Error when launching SGLang worker with llava-v1.6-34b

Thank you for your wonderful work! I have been following the demo instructions and successfully launched the controller and the Gradio web server. However, I encountered an issue when trying to launch an SGLang worker with local llava-v1.6-34b model.

Here's the command I used:

$ CUDA_VISIBLE_DEVICES=1,2,3,4 python -m sglang.launch_server --model-path models/llava-v1.6-34b --tokenizer-path models/llava-v1.6-34b-tokenizer --port 30000 --tp 4

And here's the terminal output

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
server started on [0.0.0.0]:10011
server started on [0.0.0.0]:10010
server started on [0.0.0.0]:10012
server started on [0.0.0.0]:10013
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
accepted ('127.0.0.1', 54758) with fd 41
welcome ('127.0.0.1', 54758)
accepted ('127.0.0.1', 50916) with fd 32
welcome ('127.0.0.1', 50916)
accepted ('127.0.0.1', 34778) with fd 35
welcome ('127.0.0.1', 34778)
accepted ('127.0.0.1', 46910) with fd 33
welcome ('127.0.0.1', 46910)
Rank 0: load weight begin.
Rank 1: load weight begin.
Rank 2: load weight begin.
Rank 3: load weight begin.
/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Rank 0: load weight end.
Rank 1: load weight end.
Rank 3: load weight end.
Rank 2: load weight end.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Rank 0: max_total_num_token=12849, max_prefill_num_token=4096, context_len=4096, 
disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
Rank 3: max_total_num_token=12849, max_prefill_num_token=4096, context_len=4096, 
disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
Rank 1: max_total_num_token=12849, max_prefill_num_token=4096, context_len=4096, 
disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
Rank 2: max_total_num_token=12849, max_prefill_num_token=4096, context_len=4096, 
disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [83740]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:41876 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
Process Process-1:
Traceback (most recent call last):
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 79, in start_router_process
    loop.run_until_complete(router.loop_for_forward())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/managers/router/manager.py", line 38, in loop_for_forward
    out_pyobjs = await self.model_client.step(next_step_input)
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/srt/managers/router/model_rpc.py", line 648, in _func
    await asyncio.gather(*[asyncio.to_thread(t.wait) for t in tasks])
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/rpyc/core/async_.py", line 51, in wait
    self._conn.serve(self._ttl, waiting=self._waiting)
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/rpyc/core/protocol.py", line 464, in serve
    data = self._channel.poll(timeout) and self._channel.recv()
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/rpyc/core/channel.py", line 55, in recv
    header = self.stream.read(self.FRAME_HEADER.size)
  File "/home/zhaohm14/anaconda3/envs/llava/lib/python3.10/site-packages/rpyc/core/stream.py", line 280, in read
    raise EOFError("connection closed by peer")
EOFError: connection closed by peer
HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60)

Could you please help me understand what might be causing this issue? I am eager to get the worker up and running and would greatly appreciate any assistance you can provide.

haotian-liu / LLaVA

Error when launching SGLang worker with llava-v1.6-34b #1289