lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.97k stars 4.56k forks source link

AYA-101 killing SGLang #3049

Open surak opened 9 months ago

surak commented 9 months ago

I tried Aya-101, the multilingual model, with sglang worker, and I get this. Maybe it happens to other models as well?

2024-02-15 16:04:21 | INFO | stdout | router init state: Traceback (most recent call last):
2024-02-15 16:04:21 | INFO | stdout |   File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process
2024-02-15 16:04:21 | INFO | stdout |     model_client = ModelRpcClient(server_args, port_args)
2024-02-15 16:04:21 | INFO | stdout |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 564, in __init__
2024-02-15 16:04:21 | INFO | stdout |     self.model_servers = [x[0] for x in rets]
2024-02-15 16:04:21 | INFO | stdout |                          ^^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 564, in <listcomp>
2024-02-15 16:04:21 | INFO | stdout |     self.model_servers = [x[0] for x in rets]
2024-02-15 16:04:21 | INFO | stdout |                          ^^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
2024-02-15 16:04:21 | INFO | stdout |     yield _result_or_cancel(fs.pop())
2024-02-15 16:04:21 | INFO | stdout |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
2024-02-15 16:04:21 | INFO | stdout |     return fut.result(timeout)
2024-02-15 16:04:21 | INFO | stdout |            ^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 456, in result
2024-02-15 16:04:21 | INFO | stdout |     return self.__get_result()
2024-02-15 16:04:21 | INFO | stdout |            ^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
2024-02-15 16:04:21 | INFO | stdout |     raise self._exception
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/thread.py", line 58, in run
2024-02-15 16:04:21 | INFO | stdout |     result = self.fn(*self.args, **self.kwargs)
2024-02-15 16:04:21 | INFO | stdout |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 597, in start_model_process
2024-02-15 16:04:21 | INFO | stdout |     proc.start()
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/process.py", line 121, in start
2024-02-15 16:04:21 | INFO | stdout |     self._popen = self._Popen(self)
2024-02-15 16:04:21 | INFO | stdout |                   ^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/context.py", line 224, in _Popen
2024-02-15 16:04:21 | INFO | stdout |     return _default_context.get_context().Process._Popen(process_obj)
2024-02-15 16:04:21 | INFO | stdout |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
2024-02-15 16:04:21 | INFO | stdout |     return Popen(process_obj)
2024-02-15 16:04:21 | INFO | stdout |            ^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
2024-02-15 16:04:21 | INFO | stdout |     super().__init__(process_obj)
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
2024-02-15 16:04:21 | INFO | stdout |     self._launch(process_obj)
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
2024-02-15 16:04:21 | INFO | stdout |     reduction.dump(process_obj, fp)
2024-02-15 16:04:21 | INFO | stdout |   File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
2024-02-15 16:04:21 | INFO | stdout |     ForkingPickler(file, protocol).dump(obj)
2024-02-15 16:04:21 | INFO | stdout | AttributeError: Can't pickle local object 'start_model_process.<locals>._init_service'
2024-02-15 16:04:21 | INFO | stdout | 
2024-02-15 16:04:21 | INFO | stdout | detoken init state: init ok
2024-02-15 16:04:22 | ERROR | stderr | Traceback (most recent call last):
2024-02-15 16:04:22 | ERROR | stderr |   File "/p/haicluster/llama/FastChat/fastchat/serve/sglang_worker.py", line 290, in <module>
2024-02-15 16:04:22 | ERROR | stderr |     runtime = sgl.Runtime(
2024-02-15 16:04:22 | ERROR | stderr |               ^^^^^^^^^^^^
2024-02-15 16:04:22 | ERROR | stderr |   File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/api.py", line 39, in Runtime
2024-02-15 16:04:22 | ERROR | stderr |     return Runtime(*args, **kwargs)
2024-02-15 16:04:22 | ERROR | stderr |            ^^^^^^^^^^^^^^^^^^^^^^^^
2024-02-15 16:04:22 | ERROR | stderr |   File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/server.py", line 482, in __init__
2024-02-15 16:04:22 | ERROR | stderr |     raise RuntimeError("Launch failed. Please see the error messages above.")
2024-02-15 16:04:22 | ERROR | stderr | RuntimeError: Launch failed. Please see the error messages above.
srun: error: haicluster1: task 0: Exited with exit code 1
surak commented 9 months ago

I realized that this happens to any model with multiple GPUs. See #3025

surak commented 9 months ago

Will close this one, it's a duplicate of #3025