[Bug] tensor parallel run error

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[X] 2. The bug has not been fixed in the latest version.
[X] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[X] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[X] 5. Please use English, otherwise it will be closed.

Describe the bug

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tensor-parallel-size 2

[19:01:01 TP0] Init nccl begin. [19:01:01 TP1] Init nccl begin. NCCL version 2.20.5+cuda12.4 Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices' Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:307 'peer access is not supported between these two devices' [rank1]:[W924 19:01:01.260102563 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [rank0]:[W924 19:01:01.260656038 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Reproduction

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --tensor-parallel-size 2

in 4xH100 GPU machine

Environment

CUDA available: True GPU 0,1,2,3: NVIDIA H100 GPU 0,1,2,3 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.0, V12.0.140 CUDA Driver Version: 525.105.17 PyTorch: 2.4.0+cu121 sglang: 0.3.1.post3 flashinfer: 0.1.6+cu121torch2.4 triton: 3.0.0 transformers: 4.44.2 requests: 2.32.3 tqdm: 4.66.5 numpy: 1.26.4 aiohttp: 3.10.5 fastapi: 0.115.0 hf_transfer: 0.1.8 huggingface_hub: 0.25.1 interegular: 0.3.3 packaging: 24.1 PIL: 10.4.0 psutil: 6.0.0 pydantic: 2.9.2 uvicorn: 0.30.6 uvloop: 0.20.0 zmq: 26.2.0 vllm: 0.5.5 multipart: 0.0.10 openai: 1.47.1 anthropic: 0.34.2 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X PHB PHB PHB 0-183 N/A GPU1 PHB X PHB PHB 0-183 N/A GPU2 PHB PHB X PHB 0-183 N/A GPU3 PHB PHB PHB X 0-183 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM ulimit soft: 524288

sgl-project / sglang