when I exctute code
llm = LLM("/chinese-alpaca-2-13b", tensor_parallel_size=1)
just worked fine
but when I change args like
llm = LLM("/chinese-alpaca-2-13b", tensor_parallel_size=2)
an error occurred in the code
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.15.5
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 27000
and I tried two ways but did not work
1
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(device_id)
when I exctute code
llm = LLM("/chinese-alpaca-2-13b", tensor_parallel_size=1)
just worked fine but when I change args likellm = LLM("/chinese-alpaca-2-13b", tensor_parallel_size=2)
an error occurred in the codeand I tried two ways but did not work
how can I solve this problem?