sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.9k stars 477 forks source link

llava 1.6 34b fails to load #220

Closed nivibilla closed 8 months ago

nivibilla commented 8 months ago

Using 8xA10s

!python -m sglang.launch_server --model-path /local_disk0/dillonlaird/hf-llava-v1.6-34b --host 0.0.0.0 --port 1234 --tp 8 --model-mode flashinfer

Trace

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
2024-02-22 19:53:04.011658: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10005
server started on [0.0.0.0]:10006
server started on [0.0.0.0]:10009
server started on [0.0.0.0]:10008
server started on [0.0.0.0]:10010
server started on [0.0.0.0]:10007
server started on [0.0.0.0]:10012
server started on [0.0.0.0]:10011
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 49852) with fd 35
welcome ('127.0.0.1', 49852)
accepted ('127.0.0.1', 37168) with fd 39
welcome ('127.0.0.1', 37168)
accepted ('127.0.0.1', 33530) with fd 57
accepted ('127.0.0.1', 53720) with fd 42
welcome ('127.0.0.1', 33530)
welcome ('127.0.0.1', 53720)
accepted ('127.0.0.1', 54686) with fd 42
accepted ('127.0.0.1', 60674) with fd 38
welcome ('127.0.0.1', 54686)
welcome ('127.0.0.1', 60674)
accepted ('127.0.0.1', 52836) with fd 37
welcome ('127.0.0.1', 52836)
accepted ('127.0.0.1', 55468) with fd 37
welcome ('127.0.0.1', 55468)
[2024-02-22 19:53:34,639] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 19:53:34,640] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 19:53:34,640] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 19:53:34,640] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 19:53:34,640] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 19:53:34,641] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 19:53:34,641] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-22 19:53:34,641] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Rank 6: load weight begin.
Rank 7: load weight begin.
Rank 0: load weight begin.
Rank 3: load weight begin.
Rank 5: load weight begin.
Rank 4: load weight begin.
Rank 1: load weight begin.
Rank 2: load weight begin.
config.json: 100%|█████████████████████████| 4.76k/4.76k [00:00<00:00, 39.7MB/s]
pytorch_model.bin:  91%|██████████████████▏ | 1.55G/1.71G [00:03<00:00, 400MB/s]Error while downloading from https://cdn-lfs.huggingface.co/repos/aa/ef/aaef666503e18a889e4a927d9595921c7011b713a81cf619fbc411be6f69e9d6/c6032c2e0caae3dc2d4fba35535fa6307dbb49df59c7e182b1bc4b3329b81801?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1708889857&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwODg4OTg1N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9hYS9lZi9hYWVmNjY2NTAzZTE4YTg4OWU0YTkyN2Q5NTk1OTIxYzcwMTFiNzEzYTgxY2Y2MTlmYmM0MTFiZTZmNjllOWQ2L2M2MDMyYzJlMGNhYWUzZGMyZDRmYmEzNTUzNWZhNjMwN2RiYjQ5ZGY1OWM3ZTE4MmIxYmM0YjMzMjliODE4MDE%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=dYgVtWJUGRzMp4uqzW9G-7U7tWPLAhhyMsvVWWtfSnbp46s85Ffjs0H-vT84DjSj6VYoQG65Y77BoKKQbQEio6Dfww73uf2OLMF-P7ylcjM8LO1rU8hAiTfkDtAf9fs6u2mizJ2UTfrT2kCE-3Ftns4iPB70f-ulYpaq0fHIKDgdu4Aqz6zc1IaRJV8DxsAPuAOv72p5zjV4qXTTUYoJzfPzc79qzvwvAhrzdYIKE6A495Op4VK9Dve7N-tKjrcqOaRHllz4-oIXQxfhLQIv%7EhF6VCHF447eh2-hgbcaYUWLsSA8pcgJg4SvUSk0moUMmO9DZpImd5FUr7dKwUBcvg__&Key-Pair-Id=KVTP0A1DKRTAX: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.
Trying to resume download...
Error while downloading from https://cdn-lfs.huggingface.co/repos/aa/ef/aaef666503e18a889e4a927d9595921c7011b713a81cf619fbc411be6f69e9d6/c6032c2e0caae3dc2d4fba35535fa6307dbb49df59c7e182b1bc4b3329b81801?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1708889857&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwODg4OTg1N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9hYS9lZi9hYWVmNjY2NTAzZTE4YTg4OWU0YTkyN2Q5NTk1OTIxYzcwMTFiNzEzYTgxY2Y2MTlmYmM0MTFiZTZmNjllOWQ2L2M2MDMyYzJlMGNhYWUzZGMyZDRmYmEzNTUzNWZhNjMwN2RiYjQ5ZGY1OWM3ZTE4MmIxYmM0YjMzMjliODE4MDE%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=dYgVtWJUGRzMp4uqzW9G-7U7tWPLAhhyMsvVWWtfSnbp46s85Ffjs0H-vT84DjSj6VYoQG65Y77BoKKQbQEio6Dfww73uf2OLMF-P7ylcjM8LO1rU8hAiTfkDtAf9fs6u2mizJ2UTfrT2kCE-3Ftns4iPB70f-ulYpaq0fHIKDgdu4Aqz6zc1IaRJV8DxsAPuAOv72p5zjV4qXTTUYoJzfPzc79qzvwvAhrzdYIKE6A495Op4VK9Dve7N-tKjrcqOaRHllz4-oIXQxfhLQIv%7EhF6VCHF447eh2-hgbcaYUWLsSA8pcgJg4SvUSk0moUMmO9DZpImd5FUr7dKwUBcvg__&Key-Pair-Id=KVTP0A1DKRTAX: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Read timed out.
Trying to resume download...

pytorch_model.bin:  92%|████████████████████████▊  | 1.57G/1.71G [00:00<?, ?B/s]
pytorch_model.bin:  94%|██████████████████▋ | 1.60G/1.71G [00:00<00:00, 273MB/s]
pytorch_model.bin:  96%|███████████████████▏| 1.65G/1.71G [00:00<00:00, 319MB/s]
pytorch_model.bin: 100%|████████████████████| 1.71G/1.71G [00:00<00:00, 329MB/s]
pytorch_model.bin:  92%|██████████████████▎ | 1.57G/1.71G [00:15<00:01, 101MB/s]
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Rank 1: load weight end.
Rank 5: load weight end.
Rank 3: load weight end.
Rank 4: load weight end.
Rank 6: load weight end.
Rank 2: load weight end.
Rank 7: load weight end.
Rank 0: load weight end.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 2: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Rank 5: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Rank 0: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Rank 1: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Rank 6: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Rank 4: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Rank 7: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Rank 3: max_total_num_token=293751, max_prefill_num_token=48958, context_len=4096, model_mode=['flashinfer']
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [11146]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:1234/ (Press CTRL+C to quit)
INFO:     127.0.0.1:54944 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 191, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 404, in forward_fill_batch
    ) = self.model_runner.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 478, in forward
    return self.forward_extend_multi_modal(**kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 455, in forward_extend_multi_modal
    return self.model.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llava.py", line 232, in forward
    return self.language_model(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 269, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, skip_embed)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states, residual = layer(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 191, in forward
    hidden_states = self.self_attn(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 140, in forward
    attn_output = self.attn(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 123, in forward
    return self.extend_forward(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 99, in prefill_forward_flashinfer
    o = input_metadata.prefill_wrapper.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/flashinfer/prefill.py", line 461, in forward
    return self._wrapper.forward(
RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype Half

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 191, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/p

*** WARNING: max output size exceeded, skipping output. ***

KVCache failed to dispatch with dtype Half

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 191, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 404, in forward_fill_batch
    ) = self.model_runner.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 478, in forward
    return self.forward_extend_multi_modal(**kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 455, in forward_extend_multi_modal
    return self.model.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llava.py", line 232, in forward
    return self.language_model(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 269, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, skip_embed)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states, residual = layer(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 191, in forward
    hidden_states = self.self_attn(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 140, in forward
    attn_output = self.attn(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 123, in forward
    return self.extend_forward(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 99, in prefill_forward_flashinfer
    o = input_metadata.prefill_wrapper.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/flashinfer/prefill.py", line 461, in forward
    return self._wrapper.forward(
RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype Half

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 191, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 404, in forward_fill_batch
    ) = self.model_runner.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 478, in forward
    return self.forward_extend_multi_modal(**kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 455, in forward_extend_multi_modal
    return self.model.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llava.py", line 232, in forward
    return self.language_model(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 269, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, skip_embed)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states, residual = layer(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 191, in forward
    hidden_states = self.self_attn(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 140, in forward
    attn_output = self.attn(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 123, in forward
    return self.extend_forward(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 99, in prefill_forward_flashinfer
    o = input_metadata.prefill_wrapper.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/flashinfer/prefill.py", line 461, in forward
    return self._wrapper.forward(
RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype Half

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 191, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 404, in forward_fill_batch
    ) = self.model_runner.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 478, in forward
    return self.forward_extend_multi_modal(**kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 455, in forward_extend_multi_modal
    return self.model.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llava.py", line 232, in forward
    return self.language_model(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 269, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, skip_embed)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states, residual = layer(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 191, in forward
    hidden_states = self.self_attn(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 140, in forward
    attn_output = self.attn(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 123, in forward
    return self.extend_forward(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 99, in prefill_forward_flashinfer
    o = input_metadata.prefill_wrapper.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/flashinfer/prefill.py", line 461, in forward
    return self._wrapper.forward(
RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype Half

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 176, in exposed_step
    self.forward_step()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 191, in forward_step
    self.forward_fill_batch(new_batch)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 404, in forward_fill_batch
    ) = self.model_runner.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 478, in forward
    return self.forward_extend_multi_modal(**kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_runner.py", line 455, in forward_extend_multi_modal
    return self.model.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llava.py", line 232, in forward
    return self.language_model(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 269, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, skip_embed)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 239, in forward
    hidden_states, residual = layer(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 191, in forward
    hidden_states = self.self_attn(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/models/llama2.py", line 140, in forward
    attn_output = self.attn(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 123, in forward
    return self.extend_forward(q, k, v, input_metadata)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/layers/radix_attention.py", line 99, in prefill_forward_flashinfer
    o = input_metadata.prefill_wrapper.forward(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/flashinfer/prefill.py", line 461, in forward
    return self._wrapper.forward(
RuntimeError: BatchPrefillWithPagedKVCache failed to dispatch with dtype Half

/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-5dda76ca-2de8-4eb5-a880-8a69d0d2d70e/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py:231: UserWarning: Warning: available_size=293743, max_total_num_token=293751
KV cache pool leak detected!
  warnings.warn(
HTTPConnectionPool(host='0.0.0.0', port=1234): Read timed out. (read timeout=60)
comaniac commented 8 months ago

This is caused by flashinfer, but I'm not sure why. You can remove --model-mode flashinfer first as a workaround.

nivibilla commented 8 months ago

Interestingly this works as is with flashinfer with the Mistral 7b version.

nivibilla commented 8 months ago

But yeah thanks for the tip!

Gutianpei commented 8 months ago

I have the same error with flashinfer too, llava1.5 is fine with flashinfer. Sounds like there is a bug specific with llava1.6 model loading with flashinfer.

nivibilla commented 8 months ago

Might be specific to the 34b yi model. Because 1.6 Mistral works fine for me even with flashinfer