Closed tlrmchlsmth closed 1 month ago
Might be related to https://github.com/vllm-project/vllm/issues/4108
Actually, the previous MoE illegal memory access error was also caused by slicing input tensors. Not sure what happens underlying...
This looks like a red herring, but with #6788, we can at least try to use TORCH_CUDA_SANITIZER=1 to investigate the underlying issue
Your current environment
🐛 Describe the bug
I'm debugging some hard-to-repro illegal memory accesses that are happening while running fp8 llama3 405b.
I am running the following command.
If I set
CUDA_LAUNCH_BLOCKING=1
andTORCH_CUDA_SANITIZER=1
, I get warnings about a about a possible data race in the following log.This points to the following code: https://github.com/vllm-project/vllm/blob/5689e256baf0c45148a01ad147abf11ad82c9690/vllm/worker/model_runner.py#L1137-L1152
Not sure why the slicing of
block_tables
would be an issue here.If you don't set
CUDA_LAUNCH_BLOCKING=1
, you may see an error like the following: