vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.6k stars 3.9k forks source link

[Usage]: Got nccl error when deploy vllm in k8s with multiple GPUs #7466

Open ZhaoGuoXin opened 1 month ago

ZhaoGuoXin commented 1 month ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I trying to deploy a Qwen2-72b model in k8s, with 4 GPUs in one node. Accroding to the log it seems can't enable NCCL P2P in k8s pod, even the GPUs are in the same node. Or is there a way to enable it? Here is my k8s deployment file: ` apiVersion: apps/v1 kind: Deployment metadata: name: vllm-deployment namespace: model spec: replicas: 1 # You can scale this up to 10 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: volumes:

Here is the error log: 2024-08-13T07:37:27.223526999Z vllm-deployment-8577b94b74-fhx85:61:61 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 2024-08-13T07:37:27.401053965Z vllm-deployment-8577b94b74-fhx85:61:61 [ERROR 08-13 07:37:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 111 died, exit code: -15 2024-08-13T07:37:27.401078075Z INFO 08-13 07:37:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes 2024-08-13T07:37:27.506937412Z Process Process-1: 2024-08-13T07:37:27.508190691Z Traceback (most recent call last): 2024-08-13T07:37:27.508231292Z File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() 2024-08-13T07:37:27.508242732Z File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run 2024-08-13T07:37:27.508246402Z self._target(*self._args, **self._kwargs) 2024-08-13T07:37:27.508249772Z File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server 2024-08-13T07:37:27.508253072Z server = AsyncEngineRPCServer(async_engine_args, usage_context, port) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__ 2024-08-13T07:37:27.508259922Z self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, 2024-08-13T07:37:27.508263112Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args 2024-08-13T07:37:27.508266912Z engine = cls( 2024-08-13T07:37:27.508270963Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in __init__ 2024-08-13T07:37:27.508275343Z self.engine = self._init_engine(*args, **kwargs) 2024-08-13T07:37:27.508279263Z File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine 2024-08-13T07:37:27.508283243Z return engine_class(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in __init__ 2024-08-13T07:37:27.508291193Z self.model_executor = executor_class( 2024-08-13T07:37:27.508295083Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__ 2024-08-13T07:37:27.508298743Z super().__init__(*args, **kwargs) 2024-08-13T07:37:27.508301943Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__ super().__init__(*args, **kwargs) 2024-08-13T07:37:27.508308343Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__ 2024-08-13T07:37:27.508311943Z self._init_executor() 2024-08-13T07:37:27.508315093Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor self._run_workers("init_device") 2024-08-13T07:37:27.508321623Z File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers 2024-08-13T07:37:27.508324823Z driver_worker_output = driver_worker_method(*args, **kwargs) 2024-08-13T07:37:27.508328293Z File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 132, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, 2024-08-13T07:37:27.508334623Z File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 348, in init_worker_distributed_environment 2024-08-13T07:37:27.508337834Z ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, 2024-08-13T07:37:27.508341424Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized 2024-08-13T07:37:27.508344584Z initialize_model_parallel(tensor_model_parallel_size, 2024-08-13T07:37:27.508347804Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel 2024-08-13T07:37:27.508350984Z _TP = init_model_parallel_group(group_ranks, 2024-08-13T07:37:27.508370514Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group 2024-08-13T07:37:27.508386234Z return GroupCoordinator( 2024-08-13T07:37:27.508389804Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/parallel_state.py", line 154, in __init__ 2024-08-13T07:37:27.508393004Z self.pynccl_comm = PyNcclCommunicator( File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__ 2024-08-13T07:37:27.508399564Z self.comm: ncclComm_t = self.nccl.ncclCommInitRank( 2024-08-13T07:37:27.508402775Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank 2024-08-13T07:37:27.508405955Z self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), 2024-08-13T07:37:27.508409115Z File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK 2024-08-13T07:37:27.508412315Z raise RuntimeError(f"NCCL error: {error_str}") 2024-08-13T07:37:27.508415555Z RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

youkaichao commented 1 month ago

please follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to add more details.

ZhaoGuoXin commented 1 month ago

please follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to add more details.

Here is the output of the test.py, in the vllm pod root@vllm-deployment-8577b94b74-fhx85:/vllm-workspace# python3 test.py Traceback (most recent call last): File "/vllm-workspace/test.py", line 3, in <module> dist.init_process_group(backend="nccl") File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 79, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 231, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

ZhaoGuoXin commented 1 month ago

Problem sovled. It is because my /dev/shm is too small. But the P2P disable log is still exist. So my question is, GPUs can't use p2p to conmmunicate in k8s, right?

robertgshaw2-neuralmagic commented 1 month ago

Problem sovled. It is because my /dev/shm is too small. But the P2P disable log is still exist. So my question is, GPUs can't use p2p to conmmunicate in k8s, right?

It depends on the interconnects used on your system. Is it PCIe or NvLink

ZhaoGuoXin commented 1 month ago

Problem sovled. It is because my /dev/shm is too small. But the P2P disable log is still exist. So my question is, GPUs can't use p2p to conmmunicate in k8s, right?

It depends on the interconnects used on your system. Is it PCIe or NvLink

It's PCIe