vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.25k stars 4.58k forks source link

[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error? #6870

Closed pseudotensor closed 3 months ago

pseudotensor commented 3 months ago

Your current environment

latest docker image

docker stop llama31-405b  ; docker remove llama31-405b
docker pull vllm/vllm-openai:latest
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=0,1,2,3,4,5,6,7"' \
    --shm-size=10.24gb \
    -p 5020:5020 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name llama31-405b \
    vllm/vllm-openai:latest \
        --port=5020 \
        --host=0.0.0.0 \
        --model=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
        --seed 1234 \
        --tensor-parallel-size=8 \
        --max-log-len=100 \
        --max-model-len=65536 \
        --max-num-batched-tokens=512 \
        --max_num_seqs=16 \
        --gpu-memory-utilization 0.98 \
        --enable_chunked_prefill=True \
        --enforce-eager \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.llama31_405b2.txt

🐛 Describe the bug

Complete logs

llama31-405b.log.zip

e.g.


Error: Failed to initialize the TMA descriptor 700
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 6] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e095a9b5897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7e095a965b25 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7e095aa8d718 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7e095bc8a8e6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7e095bc8e9e8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7e095bc9405c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e095bc94dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7e09a774bdf4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7e09a880d609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7e09a8947353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
robertgshaw2-neuralmagic commented 3 months ago

Thanks for reporting this. We have resolved the issue with:

This will be in the next release of vllm (ideally this week). you can use the nightlies to unblock yourself for now

hsubbaraj commented 3 months ago

@pseudotensor just to confirm did building from source:main now work for you? I'm running into the same error at runtime with pretty much the same setup as yours.

pseudotensor commented 3 months ago

Yes I built from source a docker image about 4 days ago. Seems like I used 3eeb148f467e3619e8890b1a5ebe86a173f91bc9

soodrohit commented 1 month ago

We are seeing same error when using Llama3.1-70B-Instruct model, am I correct in assuming that it will be fixed for the 70B model also?

YouNeedCryDear commented 1 month ago

@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?

chapter544 commented 4 weeks ago

We are also having this issue with Qwen-32B-Instruct-FP8

Error: Failed to initialize the TMA descriptor 700 INFO 10-17 12:26:04 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241017-122604.pkl... WARNING 10-17 12:26:04 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered WARNING 10-17 12:26:04 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

robertgshaw2-neuralmagic commented 4 weeks ago

@robertgshaw2-neuralmagic We are encountering the same issue when serving Llama-3.1-70B-Instruct-FP8 with 2xH100. I can reproduce it consistently when num of concurrent request goes up to 256, with all engine arguments as default except tensor parallel size to 2. Do you think it could possibly be an edge case even after the fix for 405B?

What version of vllm are you running?

chapter544 commented 4 weeks ago

As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.

robertgshaw2-neuralmagic commented 4 weeks ago

As for vllm version, we are using 0.6.2 and 0.6.3, and we're having the same issue with both versions. Thanks.

Can you share reproduction instructions?

chapter544 commented 4 weeks ago

Hi, We are not sure how to reliability reproduce this error. If you can provide some instructions/hints, we are happy to get the information. In our case, we started the openai_server and sent data through. It could be days or it could be a few hours that we saw this exception.

Thanks.

Please see the attached file for the error log.

vllm-error-log-10-17-2024.txt

YouNeedCryDear commented 4 weeks ago

@robertgshaw2-neuralmagic This is the command that I use to spin up the vLLM server docker run -tid --gpus \"device=4,5\" --shm-size 10g -v /mnt/data/models:/models --ulimit nofile=65535:65535 --name vllm-v0.6.2-llama3.1-70b-instruct-128k-pre-fp8 --network benchmark-network vllm/vllm-openai:v0.6.2 --model=/models/Meta-Llama-3.1-70B-Instruct-FP8 --tensor-parallel-size=2 --served-model-name=vllm-model --port=8080 --disable-log-requests Then within the benchmark bridge network, I just spin up a locust server with 128 users constantly sending requests to the above vLLM server. Each request is around 100 tokens as input prompt and max_token is 100 as well. Most of the time the server will crash with the above error within 30 seconds

chapter544 commented 3 weeks ago

hi @robertgshaw2-neuralmagic, do you have any updates on this issue. This issue prevented us from serving FP8 models.

Thank you,