Closed jeejeelee closed 1 month ago
cc @njhill
Yeah, we've fixed this issue on our fork (as you found here). Let me create a PR to contribute the fix upstream.
I was able to work around this bug by building the latest Triton code from source.
I was able to work around this bug by building the latest Triton code from source.
It seems that the triton main has updated the related code, I don't have more time to dive into this update. For now, I believe #6140 can serve as a temporary solution to address this issue. Once VLLM updates the triton version, we can revisit this issue again.
@randxie Interesting. I actually tried to test these changes that were merged into Triton main in our fork, but it didn't help. I don't really see much else that has changed in the meantime (at least in python/triton/runtime/cache.py
) so I'm wondering how it could be fixed upstream.
Agree that getting the fix from Triton is the long-term solution though.
@tdoublep I believe there were more changes involved. If you check the log above, Triton tried to load files with naming pattern as *.tmp.pid_*
, but when you go into the container, the file is not generated. I haven't spent time on figuring out the exact change, just want to provide a data point that latest Triton has fixed this issue.
removing the entire triton cache dir every time before run vllm can be a temp workaround, when I run deepseekv2. This workaround may not always work.
There was a PR merged into Triton yesterday that tries to address this issue: https://github.com/triton-lang/triton/pull/4295.
This fix is not yet included in triton==3.0.0
which was released on PyPI yesterday.
So I've been digging into this a bit more and here is a summary of my findings:
So I've been digging into this a bit more and here is a summary of my findings:
- Triton recently released v3.0.0, but it does not seem to include the fix for this issue.
- Nevertheless, multiple people have reported, before yesterday, that recent Triton nightlies resolve the issue (or similar issues).
- I can't really understand why any Triton nightly before that PR would actually be safe w.r.t this issue (see the nice diagram from the PR to understand why). Everything is confounded by the fact this is a race condition and not deterministically reproducible.
- Therefore, in my view to be fully protected from this issue we will need to pull in whatever Triton version comes after 3.0.0, and this requires waiting for torch to upgrade to that and all other dependencies etc.
- I would therefore propose that we proceed to merge [Bugfix] Add custom Triton cache manager to resolve MoE MP issue #6140 in the meantime.
- Main remaining question is whether this issue can potentially also affect Ray deployments.
Firtly, I strongly agree with merging #6140.
For multiple people have https://github.com/vllm-project/vllm/issues/6103#issuecomment-2209298536, before yesterday, that recent Triton nightlies resolve the issue (or similar issues).
This bug is occasional, in my experience, with respect to the MOE model, the occurrence probability is higher when TP=8, typically during the cuda graph
capture stage. IMHO, I think it's possible that the error hasn't been reproduced again rather than being addressed
Just for the context, I got the same bug with the latest vllm-server v0.5.1 on mistralai/Mixtral-8x22B-Instruct-v0.1
python -u -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x22B-Instruct-v0.1 --dtype auto --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --swap-space 4 --download-dir /data/vllm-data --max-seq-len-to-capture 8192 --host 0.0.0.0 --port 8079 --max-model-len 16384
Reproduced 6 out of 7 times on the first request to the server.
Any updates on this?
Fix #6140 is ready from my pov, will try to get it approved and merged asap.
Your current environment
🐛 Describe the bug
Description
5036 will raise the following error when using TP>1. Similar issues also persisted when I tested the Qwen1.5-MoE model
due to the use of the fused_moe_kernel triton kernel.
Possible solutions
I believe it's a triton bug. I then debugged the triton code and investigated the related issue. and found that the cause of this issue is well explained here. Until triton officially addresses this issue, we may need to resolve it in a similar manner. Another approach is to set
distributed_executor_backend
toray
when using the triton kernel.Reproducible Code