Closed NicolasDrapier closed 2 months ago
for hardware related problems, the general answer is to ask your admin and vendor.
Thank you @youkaichao for your answer.
As I mentioned in my initial post, I’m confident that the hardware isn’t the issue, as all stress tests have run for over 10 hours each without any problems (using gpuburn
and dcgm
).
I’m simply asking whether a potentially corrupted file could be causing errors with vLLM, which might then trigger a kernel panic or something like this (I’m unable to pinpoint the exact trigger for this issue, and I’m not entirely certain if the kernel panic is indeed the cause).
I suggest this because I’ve encountered this issue with mistralai/Mistral-Large-Instruct-2407 and meta-llama/Meta-Llama-3.1-70B-Instruct, but the problem does not occur with other models I use, such as Qwen/CodeQwen1.5-7B-Chat and microsoft/Phi-3-mini-128k-instruct.
Have there been any similar cases observed? Is there a protocol or method that I could follow to help identify what might be causing this issue?
I don't see similar cases. But you might be interested in https://docs.vllm.ai/en/latest/getting_started/debugging.html , there are some tools for you to play around to find more clues.
After activating a lot of server logs, it seems that the power supplies are undersized.
Thank you for your help @youkaichao
Your current environment
How would you like to use vllm
Description:
I've noticed some strange behavior when using
vllm
. After running a few requests on the server, my hardware sometimes crashes completely—this includes all four physical power supplies shutting off.I followed the hardware testing procedures as outlined in this guide, and all tests passed successfully.
However, there are instances where, if all four power supplies don't shut off, one or two of them will physically turn off. I tried running the same workload with
text-generation-inference
, and while I did not encounter a full server crash, some of the power supplies still tripped.The command
My Questions: