sgl-project / sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Apache License 2.0
2.75k stars 176 forks source link

peer access is not supported between these two devices #552

Open gmonair opened 2 weeks ago

gmonair commented 2 weeks ago

When upgrading from sglang 0.1.16 to 0.1.17 I get the following error when loading a model with tp=2 on a 2xT4 machine (kaggle). The same code used to work on 0.1.16

Error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices'
Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices'

[rank1]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

[...]

Code:

runtime = sgl.Runtime(model_path=model_name, tp_size=2)

This used to run fine in 0.1.16 on the same machine. The model loaded is deepseek-7b, so llamaforcausal family. Let me know if you want me to test with other models.

merrymercy commented 1 week ago

See this PR for a temporary fix. You can disable custom allreduce for your setup. If you got it fixed, please contribute a PR. https://github.com/sgl-project/sglang/pull/531