rh-aiservices-bu / llm-on-openshift

Resources, demos, recipes,... to work with LLMs on OpenShift with OpenShift AI or Open Data Hub.
Apache License 2.0
74 stars 71 forks source link

vLLM 0.4.0.post1 image missing libnccl #57

Closed bbrowning closed 2 months ago

bbrowning commented 2 months ago

Trying to run multi-GPU inference with this image, I get the below error:

INFO 04-23 19:30:01 pynccl_utils.py:17] Failed to import NCCL library: libnccl.so.2: cannot open shared object file: No such file or directory
INFO 04-23 19:30:01 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
INFO 04-23 19:30:03 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:04 pynccl.py:53] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
(RayWorkerVllm pid=1209) INFO 04-23 19:30:04 pynccl_utils.py:17] Failed to import NCCL library: libnccl.so.2: cannot open shared object file: No such file or directory
(RayWorkerVllm pid=1209) INFO 04-23 19:30:04 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
(RayWorkerVllm pid=1209) INFO 04-23 19:30:05 selector.py:16] Using FlashAttention backend.
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44] Traceback (most recent call last):
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/ray_utils.py", line 37, in execute_method
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]     return executor(*args, **kwargs)
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 100, in init_device
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]     init_distributed_environment(self.parallel_config, self.rank,
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 287, in init_distributed_environment
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]     pynccl_utils.init_process_group(
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]   File "/opt/app-root/lib64/python3.11/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 45, in init_process_group
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]     logger.info(f"vLLM is using nccl=={ncclGetVersion()}")
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44]                                        ^^^^^^^^^^^^^^
(RayWorkerVllm pid=1209) ERROR 04-23 19:30:05 ray_utils.py:44] NameError: name 'ncclGetVersion' is not defined

Looking at the installed packages, I don't see any libnccl installed. It looks like from https://github.com/rh-aiservices-bu/llm-on-openshift/blob/6864d21fdea52a714078d322b4f7b2bc058fdef6/llm-servers/vllm/Containerfile#L53 that the intention was perhaps to install a matching libnccl, but it just got missed?