Unable to run qwen successfully

env: 2080Ti * 2 cuda_12.3.r12.3/compiler.33567101_0 python3.9 pip install "sglang[all]"

error: new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

detailed log： (sglang2) ma@ubuntu-server:~$ python -m sglang.launch_server --model-path Qwen/Qwen1.5-0.5B --host 0.0.0.0 --port 1235 --mem-fraction-static 0.9 --tp 2 config.json: 661B [00:00, 48.4kB/s]
tokenizer_config.json: 1.16kB [00:00, 105kB/s]
vocab.json: 2.78MB [00:00, 6.09MB/s] merges.txt: 1.67MB [00:00, 7.41MB/s] tokenizer.json: 7.03MB [00:01, 6.60MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10009 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 38192) with fd 30 welcome ('127.0.0.1', 38192) accepted ('127.0.0.1', 56310) with fd 26 welcome ('127.0.0.1', 56310) Rank 0: load weight begin. Rank 1: load weight begin. INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['.safetensors'] INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['.safetensors'] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [01:52<00:00, 11.0MB/s] Rank 1: load weight end. Rank 0: load weight end. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Rank 0: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Rank 1: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: Started server process [135494] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:1235 (Press CTRL+C to quit) INFO: 127.0.0.1:56426 - "GET /get_model_info HTTP/1.1" 200 OK new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion !(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. Process Process-1: Traceback (most recent call last): File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 79, in start_router_process loop.run_until_complete(router.loop_for_forward()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 38, in loop_for_forward out_pyobjs = await self.model_client.step(next_step_input) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 635, in _func await asyncio.gather([asyncio.to_thread(t.wait) for t in tasks]) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer HTTPConnectionPool(host='0.0.0.0', port=1235): Read timed out. (read timeout=60)

Is anyone in the same situation as me?

I am getting a similar timeout on Mistral

both the "Unexpected mma -> mma layout conversion and Read timed out. (read timeout=60)

python -m sglang.launch_server --model-path TheBloke/Mistral-7B-Merge-14-v0.1-GPTQ --port 30000

Rank 0: load weight begin.
quant_config: GPTQConfig(weight_bits=4, group_size=128, desc_act=True)
INFO 02-18 18:30:38 weight_utils.py:163] Using model weights format ['*.safetensors']
Rank 0: load weight end.
Rank 0: max_total_num_token=45819, max_prefill_num_token=32768, context_len=32768, model_mode=[]
INFO:     Started server process [25956]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
INFO:     127.0.0.1:56830 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
INFO:     127.0.0.1:57832 - "GET /get_model_info HTTP/1.1" 200 OK
HTTPConnectionPool(host='127.0.0.1', port=10000): Read timed out. (read timeout=60)
INFO:     127.0.0.1:39578 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:46008 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:47792 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:40964 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:34552 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:54472 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:56052 - "GET /get_model_info HTTP/1.1" 200 OK

I failed running the samples, it just hangs forever no message

I encounter the same problem when I test test/srt/model/test_llama_low_api.py. I use python test_llama_low_api.py and then it crashes with the same output. The error happens in the Prefill function.

I am getting a similar timeout on Qwen-7B-chat. (GPU is NVIDIA A800. Python version is 3.8)

When I called the/generate interface, a timeout warning occurred

CUDA_VISIBLE_DEVICES=4 python -m sglang.launch_server \
--model-path ./llm_models/Qwen-7B-Chat \
--port 7080

151645
Rank 0: load weight begin.
Rank 0: load weight end.
151645
./sglang/python/sglang/srt/hf_transformers_utils.py:142: UserWarning: Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
  warnings.warn(
Rank 0: max_total_num_token=49560, max_prefill_num_token=8260, context_len=2048, model_mode=[]
151645
INFO:     Started server process [1917361]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:50848 - "GET /get_model_info HTTP/1.1" 200 OK
HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60)

The error info 'HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60)' comes from the function launch_server of sglang/srt/server.py ：

# Warmup
    try:
        # print("Warmup...", flush=True)
        res = requests.post(
            url + "/generate",
            json={
                "text": "Say this is a warmup request.",
                "sampling_params": {
                    "temperature": 0,
                    "max_new_tokens": 16,
                },
            },
            timeout=60,
        )

After debugging, I found it broke at await event.wait() of sglang/srt/managers/tokenizer_manager.py.

lock = asyncio.Lock()
  event = asyncio.Event()
  state = ReqState([], False, event, lock)
  self.rid_to_state[rid] = state

   while True:
       await event.wait()
       yield state.out_list[-1]

The function wait() comes from asyncio/locks.py, and debugging breaks at await fut:

async def wait(self):
        """Block until the internal flag is true.

        If the internal flag is true on entry, return True
        immediately.  Otherwise, block until another coroutine calls
        set() to set the flag to true, then return True.
        """
        if self._value:
            return True

        fut = self._loop.create_future()
        self._waiters.append(fut)
        try:
            await fut
            return True
        finally:
            self._waiters.remove(fut)

env: 2080Ti * 2 cuda_12.3.r12.3/compiler.33567101_0 python3.9 pip install "sglang[all]"

error: new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

detailed log： (sglang2) ma@ubuntu-server:~$ python -m sglang.launch_server --model-path Qwen/Qwen1.5-0.5B --host 0.0.0.0 --port 1235 --mem-fraction-static 0.9 --tp 2 config.json: 661B [00:00, 48.4kB/s] tokenizer_config.json: 1.16kB [00:00, 105kB/s] vocab.json: 2.78MB [00:00, 6.09MB/s] merges.txt: 1.67MB [00:00, 7.41MB/s] tokenizer.json: 7.03MB [00:01, 6.60MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10009 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 38192) with fd 30 welcome ('127.0.0.1', 38192) accepted ('127.0.0.1', 56310) with fd 26 welcome ('127.0.0.1', 56310) Rank 0: load weight begin. Rank 1: load weight begin. INFO 02-17 04:29:34 weightutils.py:163] Using model weights format ['.safetensors'] INFO 02-17 04:29:34 weightutils.py:163] Using model weights format ['.safetensors'] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [01:52<00:00, 11.0MB/s] Rank 1: load weight end. Rank 0: load weight end. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Rank 0: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Rank 1: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: Started server process [135494] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:1235 (Press CTRL+C to quit) INFO: 127.0.0.1:56426 - "GET /get_model_info HTTP/1.1" 200 OK new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion !(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. Process Process-1: Traceback (most recent call last): File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(_self._args, *self._kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 79, in start_router_process loop.run_until_complete(router.loop_for_forward()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 38, in loop_for_forward out_pyobjs = await self.model_client.step(next_step_input) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 635, in func await asyncio.gather([asyncio.to_thread(t.wait) for t in tasks]) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer HTTPConnectionPool(host='0.0.0.0', port=1235): Read timed out. (read timeout=60)

Has this problem been solved?

env: 2080Ti * 2 cuda_12.3.r12.3/compiler.33567101_0 python3.9 pip install "sglang[all]" error: new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion !(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. **detailed log：** (sglang2) ma@ubuntu-server:~$ python -m sglang.launch_server --model-path Qwen/Qwen1.5-0.5B --host 0.0.0.0 --port 1235 --mem-fraction-static 0.9 --tp 2 config.json: 661B [00:00, 48.4kB/s] tokenizer_config.json: 1.16kB [00:00, 105kB/s] vocab.json: 2.78MB [00:00, 6.09MB/s] merges.txt: 1.67MB [00:00, 7.41MB/s] tokenizer.json: 7.03MB [00:01, 6.60MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10009 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 38192) with fd 30 welcome ('127.0.0.1', 38192) accepted ('127.0.0.1', 56310) with fd 26 welcome ('127.0.0.1', 56310) Rank 0: load weight begin. Rank 1: load weight begin. INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['_.safetensors'] INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['_.safetensors'] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [01:52<00:00, 11.0MB/s] Rank 1: load weight end. Rank 0: load weight end. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Rank 0: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Rank 1: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: Started server process [135494] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:1235 (Press CTRL+C to quit) INFO: 127.0.0.1:56426 - "GET /get_model_info HTTP/1.1" 200 OK new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. Process Process-1: Traceback (most recent call last): File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(_self._args, *self._kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 79, in start_router_process loop.run_until_complete(router.loop_for_forward()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 38, in loop_for_forward out_pyobjs = await self.model_client.step(next_step_input) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 635, in func await asyncio.gather([asyncio.to_thread(t.wait) for t in tasks]) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer HTTPConnectionPool(host='0.0.0.0', port=1235): Read timed out. (read timeout=60)

Has this problem been solved?

No model can load successfully, I don't know what the problem is, no one answers me

I am getting a similar timeout on Mistral

both the "Unexpected mma -> mma layout conversion and Read timed out. (read timeout=60)

python -m sglang.launch_server --model-path TheBloke/Mistral-7B-Merge-14-v0.1-GPTQ --port 30000

Rank 0: load weight begin.
quant_config: GPTQConfig(weight_bits=4, group_size=128, desc_act=True)
INFO 02-18 18:30:38 weight_utils.py:163] Using model weights format ['*.safetensors']
Rank 0: load weight end.
Rank 0: max_total_num_token=45819, max_prefill_num_token=32768, context_len=32768, model_mode=[]
INFO:     Started server process [25956]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
INFO:     127.0.0.1:56830 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
INFO:     127.0.0.1:57832 - "GET /get_model_info HTTP/1.1" 200 OK
HTTPConnectionPool(host='127.0.0.1', port=10000): Read timed out. (read timeout=60)
INFO:     127.0.0.1:39578 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:46008 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:47792 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:40964 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:34552 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:54472 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:56052 - "GET /get_model_info HTTP/1.1" 200 OK

I failed running the samples, it just hangs forever no message

I've encountered the same issue, have you solved it?

I have the same issue

I am encountering the same issue too. In my case, the problem occurs while loading the local version of llava-v1.6-34b.

I had a similar issue with Mistral and a workaround was to update triton to 2.2.0 from 2.1.0. I found a hint here.

It triggers a dependency error where pip requires triton==2.1.0 with torch 2.1.2, but so far it seems to work fine. torch 2.1.2 requires triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64", but you have triton 2.2.0 which is incompatible.

My original error:

python -m sglang.launch_server --model-path /home/llm/mistral7Iv02 --tokenizer-path /home/llm/mistral7Iv02 --port 30000 --mem-fraction-static 0.95
    Rank 0: load weight begin.
    Rank 0: load weight end.
    Rank 0: max_total_num_token=1569, max_prefill_num_token=32768, context_len=32768, 
    disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
    INFO:     Started server process [9624]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
    INFO:     127.0.0.1:50064 - "GET /get_model_info HTTP/1.1" 200 OK
    new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
    python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
    HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60)

Can confirm that updating Triton to 2.2.0 on Linux x64 helped for me - same dependency error of course, but without any apparent negative consequences. Original error was with TheBloke/Mistral-7B-Instruct-v0.2-GPTQ, so this seems unrelated to the model.

I had the same HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60) problem when I was using Python 3.8 on a CentOS 7 machine with 8*A800.

After using a recompiled Python 3.11 with OpenSSL 1.1.1+ to install all the dependcies, there is no Read timed out issue anymore when --tp is enabled. Simply update to Triton==2.2.0 doesn't solve the issue for me and it caused KV cache leak when I was trying to start sglang server. I am currently stick with Triton==2.1.0

I've also tested --enable-flashinfer option after installing flashinfer-0.0.3+cu121torch2.1-cp311-cp311-linux_x86_64.whl, everything seems to work fine.

Did anyone get any solution? I'm getting same error with llava-next-72b

I had the same issue when I was using Python 3.8. After switching to Python 3.11 and reinstall Sglang. It seems to work.

I'll definitely suggest people to try newer python version to see if it works

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

sgl-project / sglang

Unable to run qwen successfully #199