Closed pseudotensor closed 3 months ago
Thanks for reporting this. We have resolved the issue with:
This will be in the next release of vllm (ideally this week). you can use the nightlies to unblock yourself for now
@pseudotensor just to confirm did building from source:main now work for you? I'm running into the same error at runtime with pretty much the same setup as yours.
Yes I built from source a docker image about 4 days ago. Seems like I used 3eeb148f467e3619e8890b1a5ebe86a173f91bc9
We are seeing same error when using Llama3.1-70B-Instruct model, am I correct in assuming that it will be fixed for the 70B model also?
@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?
We are also having this issue with Qwen-32B-Instruct-FP8
Error: Failed to initialize the TMA descriptor 700 INFO 10-17 12:26:04 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241017-122604.pkl... WARNING 10-17 12:26:04 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered WARNING 10-17 12:26:04 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?
What version of vllm are you running?
As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.
As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.
Can you share reproduction instructions?
Hi, We are not sure how to reliability reproduce this error. If you can provide some instructions/hints, we are happy to get the information. In our case, we started the openai_server and sent data through. It could be days or it could be a few hours that we saw this exception.
Thanks.
Please see the attached file for the error log.
@robertgshaw2-neuralmagic This is the command that I use to spin up the vLLM server
docker run -tid --gpus \"device=4,5\" --shm-size 10g -v /mnt/data/models:/models --ulimit nofile=65535:65535 --name vllm-v0.6.2-llama3.1-70b-instruct-128k-pre-fp8 --network benchmark-network vllm/vllm-openai:v0.6.2 --model=/models/Meta-Llama-3.1-70B-Instruct-FP8 --tensor-parallel-size=2 --served-model-name=vllm-model --port=8080 --disable-log-requests
Then within the benchmark bridge network, I just spin up a locust server with 128 users constantly sending requests to the above vLLM server. Each request is around 100 tokens as input prompt and max_token is 100 as well.
Most of the time the server will crash with the above error within 30 seconds
hi @robertgshaw2-neuralmagic, do you have any updates on this issue. This issue prevented us from serving FP8 models.
Thank you,
Your current environment
latest docker image
🐛 Describe the bug
Complete logs
llama31-405b.log.zip
e.g.