Closed elabz closed 5 months ago
One more detail - Docker container was started as
docker run -d \
--shm-size=10.24gb \
--gpus '"device=2"' \
-v /data/models:/root/.cache/huggingface \
--env "HF_TOKEN=ma_token" \
-p 8000:8000 \
--restart unless-stopped \
--name vllm-openai \
vllm/vllm-openai \
--host 0.0.0.0 \
--port 8000 \
--model=astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit \
--enforce-eager \
--dtype=half \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size=1
I am closing it myself - the issue appears to be in loading the model - the python script never gets to the uvicorn
section, so the actual server on 8000 never starts, even those the python script is running
Hello ! I have the same issue... Did you solve it ? I am very interested if you have any solution :) Thank you !
Hello. I am facing a similar issue. Is there a solution for this? Thanks
TLDR: The model is still being downloaded. You need to wait until INFO: Uvicorn running on socket ('0.0.0.0', 8000) (Press CTRL+C to quit)
is in the logs before you can call it.
I originally tried this with mistral. I did notice my network usage was really high, but didn't think much of it.
docker run --network=host --runtime=nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1 \
--tokenizer_mode "mistral"
However, I tried another smaller bloom-560. Same high network usage, but it eventually went to ~0 and the logs make progress. I was able to test and model and get a response.
docker run --runtime=nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model bigscience/bloom-560m
curl -X POST http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "bigscience/bloom-560m",
"prompt": "Once upon a time",
"max_tokens": 50,
"temperature": 0.7
}'
@rmr-code @paulinergt
Your current environment
Please note this is inside a Docker container built locally, so the collect_env.py was run as
docker exec -it vllm-openai sh -c '/usr/bin/python3 /vllm-workspace/collect_env.py'
Verify Python process is working inside the container:
install nettools:
docker exec -it vllm-openai apt-get install net-tools
verify ports open on container:
It listens, just not on 8000 despite the
--port 8000
directive in the commandDocker has port 8000 open and forwarded:
According to the container logs the model appears to be loading (which is also confirmed by running nvtop and seeing VRAM memory usage and GPU activity
docker logs -f vllm-openai
🐛 Describe the bug
try getting any output from the API, get Connection Reset by Peer error:
Any help or nudge in the right direction will be greatly appreciated!