sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.92k stars 482 forks source link

Unable to run qwen successfully #199

Closed maxin9966 closed 3 months ago

maxin9966 commented 8 months ago

env: 2080Ti * 2 cuda_12.3.r12.3/compiler.33567101_0 python3.9 pip install "sglang[all]"

error: new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

detailed log: (sglang2) ma@ubuntu-server:~$ python -m sglang.launch_server --model-path Qwen/Qwen1.5-0.5B --host 0.0.0.0 --port 1235 --mem-fraction-static 0.9 --tp 2 config.json: 661B [00:00, 48.4kB/s]
tokenizer_config.json: 1.16kB [00:00, 105kB/s]
vocab.json: 2.78MB [00:00, 6.09MB/s] merges.txt: 1.67MB [00:00, 7.41MB/s] tokenizer.json: 7.03MB [00:01, 6.60MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10009 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 38192) with fd 30 welcome ('127.0.0.1', 38192) accepted ('127.0.0.1', 56310) with fd 26 welcome ('127.0.0.1', 56310) Rank 0: load weight begin. Rank 1: load weight begin. INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['.safetensors'] INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['.safetensors'] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [01:52<00:00, 11.0MB/s] Rank 1: load weight end. Rank 0: load weight end. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Rank 0: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Rank 1: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: Started server process [135494] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:1235 (Press CTRL+C to quit) INFO: 127.0.0.1:56426 - "GET /get_model_info HTTP/1.1" 200 OK new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion !(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. Process Process-1: Traceback (most recent call last): File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 79, in start_router_process loop.run_until_complete(router.loop_for_forward()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 38, in loop_for_forward out_pyobjs = await self.model_client.step(next_step_input) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 635, in _func await asyncio.gather([asyncio.to_thread(t.wait) for t in tasks]) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer HTTPConnectionPool(host='0.0.0.0', port=1235): Read timed out. (read timeout=60)

maxin9966 commented 8 months ago

Is anyone in the same situation as me?

horiacristescu commented 8 months ago

I am getting a similar timeout on Mistral

both the "Unexpected mma -> mma layout conversion and Read timed out. (read timeout=60)

python -m sglang.launch_server --model-path TheBloke/Mistral-7B-Merge-14-v0.1-GPTQ --port 30000

Rank 0: load weight begin.
quant_config: GPTQConfig(weight_bits=4, group_size=128, desc_act=True)
INFO 02-18 18:30:38 weight_utils.py:163] Using model weights format ['*.safetensors']
Rank 0: load weight end.
Rank 0: max_total_num_token=45819, max_prefill_num_token=32768, context_len=32768, model_mode=[]
INFO:     Started server process [25956]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
INFO:     127.0.0.1:56830 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
INFO:     127.0.0.1:57832 - "GET /get_model_info HTTP/1.1" 200 OK
HTTPConnectionPool(host='127.0.0.1', port=10000): Read timed out. (read timeout=60)
INFO:     127.0.0.1:39578 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:46008 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:47792 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:40964 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:34552 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:54472 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:56052 - "GET /get_model_info HTTP/1.1" 200 OK

I failed running the samples, it just hangs forever no message

CSWellesSun commented 8 months ago

I encounter the same problem when I test test/srt/model/test_llama_low_api.py. I use python test_llama_low_api.py and then it crashes with the same output. The error happens in the Prefill function.

IrelandC commented 8 months ago

I am getting a similar timeout on Qwen-7B-chat. (GPU is NVIDIA A800. Python version is 3.8)

When I called the/generate interface, a timeout warning occurred

CUDA_VISIBLE_DEVICES=4 python -m sglang.launch_server \
--model-path ./llm_models/Qwen-7B-Chat \
--port 7080 
151645
Rank 0: load weight begin.
Rank 0: load weight end.
151645
./sglang/python/sglang/srt/hf_transformers_utils.py:142: UserWarning: Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
  warnings.warn(
Rank 0: max_total_num_token=49560, max_prefill_num_token=8260, context_len=2048, model_mode=[]
151645
INFO:     Started server process [1917361]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
INFO:     127.0.0.1:50848 - "GET /get_model_info HTTP/1.1" 200 OK
HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60)

The error info 'HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60)' comes from the function launch_server of sglang/srt/server.py

# Warmup
    try:
        # print("Warmup...", flush=True)
        res = requests.post(
            url + "/generate",
            json={
                "text": "Say this is a warmup request.",
                "sampling_params": {
                    "temperature": 0,
                    "max_new_tokens": 16,
                },
            },
            timeout=60,
        )

After debugging, I found it broke at await event.wait() of sglang/srt/managers/tokenizer_manager.py.

lock = asyncio.Lock()
  event = asyncio.Event()
  state = ReqState([], False, event, lock)
  self.rid_to_state[rid] = state

   while True:
       await event.wait()
       yield state.out_list[-1]

The function wait() comes from asyncio/locks.py, and debugging breaks at await fut:

async def wait(self):
        """Block until the internal flag is true.

        If the internal flag is true on entry, return True
        immediately.  Otherwise, block until another coroutine calls
        set() to set the flag to true, then return True.
        """
        if self._value:
            return True

        fut = self._loop.create_future()
        self._waiters.append(fut)
        try:
            await fut
            return True
        finally:
            self._waiters.remove(fut)
IrelandC commented 8 months ago

env: 2080Ti * 2 cuda_12.3.r12.3/compiler.33567101_0 python3.9 pip install "sglang[all]"

error: new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

detailed log: (sglang2) ma@ubuntu-server:~$ python -m sglang.launch_server --model-path Qwen/Qwen1.5-0.5B --host 0.0.0.0 --port 1235 --mem-fraction-static 0.9 --tp 2 config.json: 661B [00:00, 48.4kB/s] tokenizer_config.json: 1.16kB [00:00, 105kB/s] vocab.json: 2.78MB [00:00, 6.09MB/s] merges.txt: 1.67MB [00:00, 7.41MB/s] tokenizer.json: 7.03MB [00:01, 6.60MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10009 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 38192) with fd 30 welcome ('127.0.0.1', 38192) accepted ('127.0.0.1', 56310) with fd 26 welcome ('127.0.0.1', 56310) Rank 0: load weight begin. Rank 1: load weight begin. INFO 02-17 04:29:34 weightutils.py:163] Using model weights format ['.safetensors'] INFO 02-17 04:29:34 weightutils.py:163] Using model weights format ['.safetensors'] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [01:52<00:00, 11.0MB/s] Rank 1: load weight end. Rank 0: load weight end. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Rank 0: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Rank 1: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: Started server process [135494] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:1235 (Press CTRL+C to quit) INFO: 127.0.0.1:56426 - "GET /get_model_info HTTP/1.1" 200 OK new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion !(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. Process Process-1: Traceback (most recent call last): File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(_self._args, *self._kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 79, in start_router_process loop.run_until_complete(router.loop_for_forward()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 38, in loop_for_forward out_pyobjs = await self.model_client.step(next_step_input) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 635, in func await asyncio.gather([asyncio.to_thread(t.wait) for t in tasks]) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer HTTPConnectionPool(host='0.0.0.0', port=1235): Read timed out. (read timeout=60)

Has this problem been solved?

maxin9966 commented 8 months ago

env: 2080Ti * 2 cuda_12.3.r12.3/compiler.33567101_0 python3.9 pip install "sglang[all]" error: new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion !(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. **detailed log:** (sglang2) ma@ubuntu-server:~$ python -m sglang.launch_server --model-path Qwen/Qwen1.5-0.5B --host 0.0.0.0 --port 1235 --mem-fraction-static 0.9 --tp 2 config.json: 661B [00:00, 48.4kB/s] tokenizer_config.json: 1.16kB [00:00, 105kB/s] vocab.json: 2.78MB [00:00, 6.09MB/s] merges.txt: 1.67MB [00:00, 7.41MB/s] tokenizer.json: 7.03MB [00:01, 6.60MB/s] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. server started on [0.0.0.0]:10008 server started on [0.0.0.0]:10009 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. accepted ('127.0.0.1', 38192) with fd 30 welcome ('127.0.0.1', 38192) accepted ('127.0.0.1', 56310) with fd 26 welcome ('127.0.0.1', 56310) Rank 0: load weight begin. Rank 1: load weight begin. INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['_.safetensors'] INFO 02-17 04:29:34 weight_utils.py:163] Using model weights format ['_.safetensors'] model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.24G/1.24G [01:52<00:00, 11.0MB/s] Rank 1: load weight end. Rank 0: load weight end. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Rank 0: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Rank 1: max_total_num_token=382390, max_prefill_num_token=63731, context_len=32768, model_mode=[] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: Started server process [135494] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:1235 (Press CTRL+C to quit) INFO: 127.0.0.1:56426 - "GET /get_model_info HTTP/1.1" 200 OK new fill batch. #seq: 1. #cached_token: 0. #new_token: 8. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. Process Process-1: Traceback (most recent call last): File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(_self._args, *self._kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 79, in start_router_process loop.run_until_complete(router.loop_for_forward()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/manager.py", line 38, in loop_for_forward out_pyobjs = await self.model_client.step(next_step_input) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/sglang/srt/managers/router/model_rpc.py", line 635, in func await asyncio.gather([asyncio.to_thread(t.wait) for t in tasks]) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/async_.py", line 51, in wait self._conn.serve(self._ttl) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/protocol.py", line 438, in serve data = self._channel.poll(timeout) and self._channel.recv() File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/channel.py", line 55, in recv header = self.stream.read(self.FRAME_HEADER.size) File "/home/ma/ENTER/envs/sglang2/lib/python3.9/site-packages/rpyc/core/stream.py", line 280, in read raise EOFError("connection closed by peer") EOFError: connection closed by peer HTTPConnectionPool(host='0.0.0.0', port=1235): Read timed out. (read timeout=60)

Has this problem been solved?

No model can load successfully, I don't know what the problem is, no one answers me

Seumi commented 8 months ago

I am getting a similar timeout on Mistral

both the "Unexpected mma -> mma layout conversion and Read timed out. (read timeout=60)

python -m sglang.launch_server --model-path TheBloke/Mistral-7B-Merge-14-v0.1-GPTQ --port 30000

Rank 0: load weight begin.
quant_config: GPTQConfig(weight_bits=4, group_size=128, desc_act=True)
INFO 02-18 18:30:38 weight_utils.py:163] Using model weights format ['*.safetensors']
Rank 0: load weight end.
Rank 0: max_total_num_token=45819, max_prefill_num_token=32768, context_len=32768, model_mode=[]
INFO:     Started server process [25956]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:10000 (Press CTRL+C to quit)
INFO:     127.0.0.1:56830 - "GET /get_model_info HTTP/1.1" 200 OK
new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
INFO:     127.0.0.1:57832 - "GET /get_model_info HTTP/1.1" 200 OK
HTTPConnectionPool(host='127.0.0.1', port=10000): Read timed out. (read timeout=60)
INFO:     127.0.0.1:39578 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:46008 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:47792 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:40964 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:34552 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:54472 - "GET /get_model_info HTTP/1.1" 200 OK
INFO:     127.0.0.1:56052 - "GET /get_model_info HTTP/1.1" 200 OK

I failed running the samples, it just hangs forever no message

I've encountered the same issue, have you solved it?

Jasonsey commented 7 months ago

I have the same issue

zhaohm14 commented 7 months ago

I am encountering the same issue too. In my case, the problem occurs while loading the local version of llava-v1.6-34b.

stevezxzhou commented 7 months ago

same issue

sneglen commented 7 months ago

I had a similar issue with Mistral and a workaround was to update triton to 2.2.0 from 2.1.0. I found a hint here.

It triggers a dependency error where pip requires triton==2.1.0 with torch 2.1.2, but so far it seems to work fine. torch 2.1.2 requires triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64", but you have triton 2.2.0 which is incompatible.

My original error:

python -m sglang.launch_server --model-path /home/llm/mistral7Iv02 --tokenizer-path /home/llm/mistral7Iv02 --port 30000 --mem-fraction-static 0.95
    Rank 0: load weight begin.
    Rank 0: load weight end.
    Rank 0: max_total_num_token=1569, max_prefill_num_token=32768, context_len=32768, 
    disable_radix_cache=False, enable_flashinfer=False, disable_regex_jump_forward=False, disable_disk_cache=False, attention_reduce_in_fp32=False
    INFO:     Started server process [9624]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
    INFO:     127.0.0.1:50064 - "GET /get_model_info HTTP/1.1" 200 OK
    new fill batch. #seq: 1. #cached_token: 0. #new_token: 9. #remaining_req: 0. #running_req: 0. tree_cache_hit_rate: 0.00%.
    python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
    HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60)
pikaro commented 7 months ago

Can confirm that updating Triton to 2.2.0 on Linux x64 helped for me - same dependency error of course, but without any apparent negative consequences. Original error was with TheBloke/Mistral-7B-Instruct-v0.2-GPTQ, so this seems unrelated to the model.

m0g1cian commented 6 months ago

I had the same HTTPConnectionPool(host='127.0.0.1', port=30000): Read timed out. (read timeout=60) problem when I was using Python 3.8 on a CentOS 7 machine with 8*A800.

After using a recompiled Python 3.11 with OpenSSL 1.1.1+ to install all the dependcies, there is no Read timed out issue anymore when --tp is enabled. Simply update to Triton==2.2.0 doesn't solve the issue for me and it caused KV cache leak when I was trying to start sglang server. I am currently stick with Triton==2.1.0

I've also tested --enable-flashinfer option after installing flashinfer-0.0.3+cu121torch2.1-cp311-cp311-linux_x86_64.whl, everything seems to work fine.

Iven2132 commented 5 months ago

Did anyone get any solution? I'm getting same error with llava-next-72b

M0gician commented 5 months ago

I had the same issue when I was using Python 3.8. After switching to Python 3.11 and reinstall Sglang. It seems to work.

I'll definitely suggest people to try newer python version to see if it works

github-actions[bot] commented 3 months ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.