Closed sigridjineth closed 1 month ago
Hi, please paste your environment using https://github.com/vllm-project/vllm/blob/main/collect_env.py , so that we can help you better.
@youkaichao I tried to run it before but got this error - can you help me out
(sionic) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/vllm$ python ./collect_env.py
Collecting environment information...
Traceback (most recent call last):
File "/home/sionic/sigrid/vllm/./collect_env.py", line 719, in <module>
main()
File "/home/sionic/sigrid/vllm/./collect_env.py", line 698, in main
output = get_pretty_env_info()
File "/home/sionic/sigrid/vllm/./collect_env.py", line 693, in get_pretty_env_info
return pretty_str(get_env_info())
File "/home/sionic/sigrid/vllm/./collect_env.py", line 499, in get_env_info
pip_version, pip_list_output = get_pip_packages(run_lambda)
File "/home/sionic/sigrid/vllm/./collect_env.py", line 469, in get_pip_packages
out = run_with_pip([sys.executable, '-mpip'])
File "/home/sionic/sigrid/vllm/./collect_env.py", line 465, in run_with_pip
return "\n".join(line for line in out.splitlines()
AttributeError: 'NoneType' object has no attribute 'splitlines'
This is strange. Your environment might be broken. What happens when you manually execute python -mpip list --format=freeze
?
@youkaichao I am using uv manager, which is Rust-based python package manager.
and here's the uv pip freeze:
(logickor-pipeline) sionic@iZmj7ir0ircgij46j89st9Z:~/sigrid/logickor-pipeline$ uv pip freeze
aiosignal==1.3.1
aiosqlite==0.20.0
annotated-types==0.6.0
anyio==4.3.0
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cupy-cuda12x==12.1.0
diskcache==5.6.3
distro==1.9.0
dnspython==2.6.1
email-validator==2.1.1
exceptiongroup==1.2.0
fastapi==0.110.0
fastrlock==0.8.2
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.3.1
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.22.2
idna==3.6
interegular==0.3.3
isort==5.13.2
jinja2==3.1.3
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
lark==1.1.9
llvmlite==0.42.0
markupsafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
numba==0.59.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
openai==1.14.3
outlines==0.0.37
packaging==24.0
pandas==2.2.1
prometheus-client==0.20.0
protobuf==5.26.1
psutil==5.9.8
pydantic==2.6.4
pydantic-core==2.16.3
pydantic-settings==2.2.1
pynvml==11.5.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
pyyaml==6.0.1
ray==2.10.0
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rpds-py==0.18.0
ruff==0.3.4
safetensors==0.4.2
scipy==1.12.0
sentencepiece==0.2.0
six==1.16.0
sniffio==1.3.1
sqlalchemy==2.0.29
sqlmodel==0.0.16
starlette==0.36.3
sympy==1.12
tokenizers==0.15.2
torch==2.1.2
tqdm==4.66.2
transformers==4.39.2
triton==2.1.0
typing-extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
uvicorn==0.29.0
uvloop==0.19.0
vllm==0.3.3
watchfiles==0.21.0
websockets==12.0
xformers==0.0.23.post1
@youkaichao this issue happens same in Docker container
I don't know if uv
is supported by vllm (most likely no). I would recommend using conda
instead.
@youkaichao uv uses virtualenv under the hood, so you mean only conda is supported for vllm library?
I would say conda
is the most tested, and I wouldn't be surprised if virtualenv or uv does not work for vllm.
okay, are there anyone trying to run vllm in docker settings?
I encountered the most of my times dealing with An error occurred: NCCLBackend is not available. Please install cupy.
when initializing llm instance in docker container.
First I suggest you switch to conda
, the problem might be improper package management and some dependency like cupy is corrupted.
Second, which version of vllm do you use? We recently removed the cupy dependency , and also released v0.4.0 . You can try the new version.
@youkaichao okay, will try new version.
@sigridjineth Just curious - why not run just with the vllm api server as opposed to rebuilding your own?
The API server code you have written is not the right way to use the LLM class. In your /generate
method, you are creating a whole new instance of an LLM, which [loads the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache, etc]. Since each request is passed to generate, you will have a long time for each request :)
The way our API server works is that we [ load the models weights from disk, runs the profiler steps to see how much memory there is, allocates the full KV cache ] once, then during inference time we use this state. If you really do need to build an API server yourself rather than using the interfaces we provide, I would suggest looking in vllm/entrypoints/api_server.py
for inspiration on how to do things properly
But you should have a very good reason for remaking this yourself
Hey @sigridjineth , regarding you "stuck init" issue, how are you starting your container? are you by any chance running the container using Sagemaker or Vertex AI? in any case, i would guess that you are probably lacking shared memory for gpus inter communication so if you start the docker directly, run it with --shm-size="SOME_SIZEgb", also, make sure that container has enough storage for downloading the model shards, using VLLM you can do:
model = LLM(.. ,download_dir="/dev/shm/cache/some_sub_dir_name_if_you_wish",)
and, if it still fails, before you load the model, add:
ray_tmp_dir = "/dev/shm/tmp/ray"
os.makedirs(ray_tmp_dir, exist_ok=True)
ray.init(_temp_dir=ray_tmp_dir, num_gpus=model_config.tensor_parallel_size)
We have added documentation for this situation in #5430. Please take a look.
Your current environment
A100 x 8, ubuntu
🐛 Describe the bug
hello, I am trying to run
vllm
inference behind the fastapi's server, but it stucks atUsing model weights format ['*.safetensors']
. Are there anyone experiencing such a case?The code I am using is like the below.