vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.59k stars 3.9k forks source link

Distributed Inference error #1593

Closed BlackHandsomeLee closed 5 months ago

BlackHandsomeLee commented 10 months ago

when I exctute code llm = LLM("/chinese-alpaca-2-13b", tensor_parallel_size=1) just worked fine but when I change args like llm = LLM("/chinese-alpaca-2-13b", tensor_parallel_size=2) an error occurred in the code

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.15.5 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 27000

and I tried two ways but did not work

1
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(device_id)
2
device_id = rank % torch.cuda.device_count()
torch.cuda.set_device(device_id)

how can I solve this problem?

shuoYan97 commented 4 months ago

Any suggestion for this problem with vllm==0.4.0.post1? Thanks