sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sglang.readthedocs.io/en/latest/
Apache License 2.0
5.36k stars 388 forks source link

Fix Regression: Disable p2p for 4090 #531

Closed ZX-ModelCloud closed 3 months ago

ZX-ModelCloud commented 3 months ago

Stacktrace:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[gpu_id=0] Set cuda device.
[gpu_id=1] Set cuda device.
[gpu_id=0] Init nccl begin.
[gpu_id=1] Init nccl begin.
Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices'
Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank1]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Initialization failed. router_init_state: Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/manager_single.py", line 76, in start_controller_process
    model_client = ModelTpClient(
                   ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 793, in __init__
    self.step = async_wrap("step")
                ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 784, in async_wrap
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 784, in <listcomp>
    fs = [rpyc.async_(getattr(m, func_name)) for m in self.model_servers]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/miniconda3/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/sglang/srt/managers/controller/tp_worker.py", line 772, in init_model
    return self.model_services[i].ModelTpServer(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/netref.py", line 239, in __call__
    return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/netref.py", line 63, in syncreq
    return conn.sync_request(handler, proxy, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/protocol.py", line 744, in sync_request
    return _async_res.value
           ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/async_.py", line 109, in value
    self.wait()
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/async_.py", line 51, in wait
    self._conn.serve(self._ttl, waiting=self._waiting)
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/protocol.py", line 464, in serve
    data = self._channel.poll(timeout) and self._channel.recv()
                                           ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/channel.py", line 55, in recv
    header = self.stream.read(self.FRAME_HEADER.size)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/rpyc/core/stream.py", line 280, in read
    raise EOFError("connection closed by peer")
EOFError: connection closed by peer

Initialization failed. detoken_init_state: init ok
Qubitium commented 3 months ago

As far as I know, RTX 40 series is the only modern Nvidia gpu that do not support P2P. Shame on Nvidia. Attempting to force use them will cause errors.

fpreiss commented 3 months ago

I stumbled across this issue and used the following dirty workaround. It won't give you p2p and thus degrades performance, but the model will still be split across the cards.

~~1. Comment out the line where monkey_patch_vllm_p2p_access_check() is called in python/sglang/srt/managers/controller/model_runner.py

  1. In your terminal session set export NCCL_IGNORE_DISABLED_P2P=1 as described in https://github.com/vllm-project/vllm/issues/406
  2. Try again~~

My bad, didn't realize that I'm looking at a pull request instead of an issue.

fpreiss commented 3 months ago

Alternatively p2p can be enabled for 4090 GPUs with this fork of the gpu kernel modules (have not tried it yet): https://github.com/tinygrad/open-gpu-kernel-modules

Qubitium commented 3 months ago

Alternatively p2p can be enabled for 4090 GPUs with this fork of the gpu kernel modules (have not tried it yet): https://github.com/tinygrad/open-gpu-kernel-modules

It works! Tested the tinycorp nvidia driver and nccl/p2p works for 4090 (albeit slow) .