sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
6.23k stars 538 forks source link

[Bug] tensor parallel run error #1509

Closed jerryzh168 closed 2 months ago

jerryzh168 commented 2 months ago

Checklist

Describe the bug

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tensor-parallel-size 2

[19:01:01 TP0] Init nccl begin. [19:01:01 TP1] Init nccl begin. NCCL version 2.20.5+cuda12.4 Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices' Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices' [rank1]:[W924 19:01:01.260102563 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [rank0]:[W924 19:01:01.260656038 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Reproduction

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tensor-parallel-size 2

in 4xH100 GPU machine

Environment

CUDA available: True GPU 0,1,2,3: NVIDIA H100 GPU 0,1,2,3 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.0, V12.0.140 CUDA Driver Version: 525.105.17 PyTorch: 2.4.0+cu121 sglang: 0.3.1.post3 flashinfer: 0.1.6+cu121torch2.4 triton: 3.0.0 transformers: 4.44.2 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.5 fastapi: 0.115.0 hf_transfer: 0.1.8 huggingface_hub: 0.25.1 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.9.2 uvicorn: 0.30.6 uvloop: 0.20.0 zmq: 26.2.0 vllm: 0.5.5 multipart: 0.0.10 openai: 1.47.1 anthropic: 0.34.2 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X PHB PHB PHB 0-183 N/A GPU1 PHB X PHB PHB 0-183 N/A GPU2 PHB PHB X PHB 0-183 N/A GPU3 PHB PHB PHB X 0-183 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM ulimit soft: 524288

jerryzh168 commented 2 months ago

adding --enable-p2p-check according to https://github.com/sgl-project/sglang/issues/991 seems to resolve the issue