Open servient-ashwin opened 3 months ago
@servient-ashwin Could you please reduce the gpu-memory utilization
to 0.9 (default) or 0.95? Because vLLM's memory profiling is not 100% accurate, setting too high gpu-memory-utilization
may lead to OOMs when there's extra memory usage that is not captured in memory profiling.
Got it, The reason we set that to 1 was because A10G has just enough (24GB) memory to load the model mentioned (full precision, non quantized). At 0.9
the model wouldn't load with it's full context length.
@WoosukKwon I tried all the combinations for memory utilization along with the one you suggested, and I continue to see this error with long form contexts.
At this moment I am unable to figure out the root cause for the memory leak that causes this OOM apart from my observation around request lengths, GPU usages and monitoring token generation latencies, however what I'd like to know as a stop gap is are there any ways to hot reload
the current server on cuda oom
. Since there could be a variety of reasons for OOM errors, I came across this toolkit from nvidia compute sanitizer, however that is a superficial solution to this issue.
What an I do to implement hot reload to the the model server on OOM for now*?
Your current environment
🐛 Describe the bug
I am seeing errors when requests are too large and too frequent I get cuda OOM errors. That's a user/application issue and how things are handled before connecting to the server.
However, every subsequent request regardless of it's size now gives cuda OOM errors unless you restart the server. Is there a way to soft relaod when you hit OOM errors or any other possible way this could be solved since one cannot restart the server if we encounter this every time.
Other steps that have already been tried include reducing GPU memory utilization, timeouts, changing context lengths, but they feel like stop-gap for the GPU memory issue concern.
Steps to reproduce
0.5.1
and |NVIDIA-SMI 550.54.14
Driver Version: 550.54.14
CUDA Version: 12.4
|NVIDIA A10G
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --download-dir /tmp/ --port 8006 --tensor-parallel-size 1 --gpu-memory-utilization 1 &