Open surak opened 9 months ago
I tried Aya-101, the multilingual model, with sglang worker, and I get this. Maybe it happens to other models as well?
2024-02-15 16:04:21 | INFO | stdout | router init state: Traceback (most recent call last): 2024-02-15 16:04:21 | INFO | stdout | File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process 2024-02-15 16:04:21 | INFO | stdout | model_client = ModelRpcClient(server_args, port_args) 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 564, in __init__ 2024-02-15 16:04:21 | INFO | stdout | self.model_servers = [x[0] for x in rets] 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 564, in <listcomp> 2024-02-15 16:04:21 | INFO | stdout | self.model_servers = [x[0] for x in rets] 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator 2024-02-15 16:04:21 | INFO | stdout | yield _result_or_cancel(fs.pop()) 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel 2024-02-15 16:04:21 | INFO | stdout | return fut.result(timeout) 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 456, in result 2024-02-15 16:04:21 | INFO | stdout | return self.__get_result() 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result 2024-02-15 16:04:21 | INFO | stdout | raise self._exception 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/concurrent/futures/thread.py", line 58, in run 2024-02-15 16:04:21 | INFO | stdout | result = self.fn(*self.args, **self.kwargs) 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 597, in start_model_process 2024-02-15 16:04:21 | INFO | stdout | proc.start() 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/process.py", line 121, in start 2024-02-15 16:04:21 | INFO | stdout | self._popen = self._Popen(self) 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/context.py", line 224, in _Popen 2024-02-15 16:04:21 | INFO | stdout | return _default_context.get_context().Process._Popen(process_obj) 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/context.py", line 288, in _Popen 2024-02-15 16:04:21 | INFO | stdout | return Popen(process_obj) 2024-02-15 16:04:21 | INFO | stdout | ^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__ 2024-02-15 16:04:21 | INFO | stdout | super().__init__(process_obj) 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__ 2024-02-15 16:04:21 | INFO | stdout | self._launch(process_obj) 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch 2024-02-15 16:04:21 | INFO | stdout | reduction.dump(process_obj, fp) 2024-02-15 16:04:21 | INFO | stdout | File "/easybuild/2024/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/reduction.py", line 60, in dump 2024-02-15 16:04:21 | INFO | stdout | ForkingPickler(file, protocol).dump(obj) 2024-02-15 16:04:21 | INFO | stdout | AttributeError: Can't pickle local object 'start_model_process.<locals>._init_service' 2024-02-15 16:04:21 | INFO | stdout | 2024-02-15 16:04:21 | INFO | stdout | detoken init state: init ok 2024-02-15 16:04:22 | ERROR | stderr | Traceback (most recent call last): 2024-02-15 16:04:22 | ERROR | stderr | File "/p/haicluster/llama/FastChat/fastchat/serve/sglang_worker.py", line 290, in <module> 2024-02-15 16:04:22 | ERROR | stderr | runtime = sgl.Runtime( 2024-02-15 16:04:22 | ERROR | stderr | ^^^^^^^^^^^^ 2024-02-15 16:04:22 | ERROR | stderr | File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/api.py", line 39, in Runtime 2024-02-15 16:04:22 | ERROR | stderr | return Runtime(*args, **kwargs) 2024-02-15 16:04:22 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-15 16:04:22 | ERROR | stderr | File "/p/haicluster/llama/FastChat/sc_venv_2024/venv/lib/python3.11/site-packages/sglang/srt/server.py", line 482, in __init__ 2024-02-15 16:04:22 | ERROR | stderr | raise RuntimeError("Launch failed. Please see the error messages above.") 2024-02-15 16:04:22 | ERROR | stderr | RuntimeError: Launch failed. Please see the error messages above. srun: error: haicluster1: task 0: Exited with exit code 1
I realized that this happens to any model with multiple GPUs. See #3025
Will close this one, it's a duplicate of #3025
I tried Aya-101, the multilingual model, with sglang worker, and I get this. Maybe it happens to other models as well?