vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.16k stars 3.83k forks source link

[Bug]: NCCL locating mechanism in multi-user environment #4224

Open ticoneva opened 4 months ago

ticoneva commented 4 months ago

Your current environment

🐛 Describe the bug

It seems that vLLM's NCCL detection mechanism is written with single user in mind. The vLLM-managed NCCL .so file is only installed for the user who installed vLLM. find_nccl_library is written such that if VLLM_NCCL_SO_PATH is not specified nor is the vLLM-managed version of NCCL found, it defaults to a hard-coded NCCL filename so_file, then reports to the user "Found nccl from library {so_file}".

I see two problems here:

  1. [find_nccl_library] reports "Found nccl..." even when it has not. Perhaps it should say instead "Attempting to load nccl..."?
  2. The vLLM-managed NCCL .so file is only installed to the home directory of the user who installed vLLM, meaning that it is inaccessible to other users. Why is it not installed along the package, which is accessible to all users?
youkaichao commented 4 months ago

The vLLM-managed NCCL .so file is only installed to the home directory of the user who installed vLLM, meaning that it is inaccessible to other users.

You can use VLLM_NCCL_SO_PATH to point to that file. Then all users can find it.

Why is it not installed along the package, which is accessible to all users?

Pypi package cannot exceed 100MB.

[find_nccl_library] reports "Found nccl..." even when it has not. Perhaps it should say instead "Attempting to load nccl..."?

Feel free to file a pr to improve it.