sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.07k stars 508 forks source link

Fail to load TheBloke/tulu-2-dpo-70B-AWQ on A800*2: TimeoutError: result expired #99

Closed Penglikai closed 3 months ago

Penglikai commented 9 months ago

Thanks for your great work! I am trying to load the AWQ model of Tulu-2-dpo-70B, here is my command line input: CUDA_VISIBLE_DEVICES=0,1 python -m sglang.launch_server --model-path TheBloke/tulu-2-dpo-70B-AWQ --tokenizer-path TheBloke/tulu-2-dpo-70B-AWQ --port 30000 --mem-fraction-static 0.5 --tp-size 2 And it took me over 20 min to load the checkpoint into GPU memory, and I finally get the error:

server started on [0.0.0.0]:10011
accepted ('127.0.0.1', 51884) with fd 6
welcome ('127.0.0.1', 51884)
accepted ('127.0.0.1', 40934) with fd 6
welcome ('127.0.0.1', 40934)
Rank 1: load weight begin.
quant_config: AWQConfig(weight_bits=4, group_size=128, zero_point=True)
Rank 0: load weight begin.
quant_config: AWQConfig(weight_bits=4, group_size=128, zero_point=True)
Rank 1: load weight end.
router init state: Traceback (most recent call last):
  File "/home/share/likai/sglang/python/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
  File "/home/share/likai/sglang/python/sglang/srt/managers/router/model_rpc.py", line 521, in __init__
    rets = [obtain(x) for x in executor.map(init_model, range(tp_size))]
  File "/home/share/likai/sglang/python/sglang/srt/managers/router/model_rpc.py", line 521, in <listcomp>
    rets = [obtain(x) for x in executor.map(init_model, range(tp_size))]
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/share/likai/sglang/python/sglang/srt/managers/router/model_rpc.py", line 519, in init_model
    return self.model_servers[i].init_model(i, server_args, port_args)
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/site-packages/rpyc/core/netref.py", line 240, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/site-packages/rpyc/core/netref.py", line 63, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/site-packages/rpyc/core/protocol.py", line 718, in sync_request
    return _async_res.value
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/site-packages/rpyc/core/async_.py", line 106, in value
    self.wait()
  File "/home/likai/.conda/envs/powerinfer/lib/python3.10/site-packages/rpyc/core/async_.py", line 55, in wait
    raise AsyncResultTimeout("result expired")
TimeoutError: result expired
detoken init state: init ok

Could you please help me check it? Thank you so much.

merrymercy commented 9 months ago

Could you try to run the command again? The first time it may require downloading which can take a lot of time.

Penglikai commented 9 months ago

Could you try to run the command again? The first time it may require downloading which can take a lot of time.

I updated sglang to the latest version and ran it again.

Get the similar output on 2 V100 GPUs. Probably not because of the model downloading issue, since the model was already downloaded.

My input:

CUDA_VISIBLE_DEVICES=0,3 python -m sglang.launch_server --model-path TheBloke/tulu-2-dpo-70B-AWQ --tokenizer-path TheBloke/tulu-2-dpo-70B-AWQ --port 30000 --mem-fraction-static 0.8 --tp-size 2

Then it get stuck on finish loading weight on Rank 0 for like 20 mins output:

server started on [0.0.0.0]:10004
server started on [0.0.0.0]:10005
accepted ('127.0.0.1', 60712) with fd 25
welcome ('127.0.0.1', 60712)
accepted ('127.0.0.1', 40380) with fd 25
welcome ('127.0.0.1', 40380)
Rank 1: load weight begin.
quant_config: AWQConfig(weight_bits=4, group_size=128, zero_point=True)
Rank 0: load weight begin.
quant_config: AWQConfig(weight_bits=4, group_size=128, zero_point=True)
INFO 02-19 15:51:07 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-19 15:51:07 weight_utils.py:164] Using model weights format ['*.safetensors']
Rank 1: load weight end.
router init state: Traceback (most recent call last):
  File "/home/data2/likai/sglang/python/sglang/srt/managers/router/manager.py", line 68, in start_router_process
    model_client = ModelRpcClient(server_args, port_args)
  File "/home/data2/likai/sglang/python/sglang/srt/managers/router/model_rpc.py", line 628, in __init__
    rets = [obtain(x) for x in executor.map(init_model, range(tp_size))]
  File "/home/data2/likai/sglang/python/sglang/srt/managers/router/model_rpc.py", line 628, in <listcomp>
    rets = [obtain(x) for x in executor.map(init_model, range(tp_size))]
  File "/home/likai/.conda/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/home/likai/.conda/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/home/likai/.conda/envs/sglang/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/home/likai/.conda/envs/sglang/lib/python3.10/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/data2/likai/sglang/python/sglang/srt/managers/router/model_rpc.py", line 626, in init_model
    return self.model_servers[i].init_model(i, server_args, port_args)
  File "/home/likai/.conda/envs/sglang/lib/python3.10/site-packages/rpyc/core/netref.py", line 240, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
  File "/home/likai/.conda/envs/sglang/lib/python3.10/site-packages/rpyc/core/netref.py", line 63, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/home/likai/.conda/envs/sglang/lib/python3.10/site-packages/rpyc/core/protocol.py", line 718, in sync_request
    return _async_res.value
  File "/home/likai/.conda/envs/sglang/lib/python3.10/site-packages/rpyc/core/async_.py", line 106, in value
    self.wait()
  File "/home/likai/.conda/envs/sglang/lib/python3.10/site-packages/rpyc/core/async_.py", line 55, in wait
    raise AsyncResultTimeout("result expired")
TimeoutError: result expired
github-actions[bot] commented 3 months ago

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.