Open qaz-wsx-1 opened 4 months ago
Hi, same problem here, any idea how to fix it?
It is known that requesting prompt_logprobs can cause a spike in memory usage and lead to an crash. OOMs due to logprobs are being tracked on: https://github.com/vllm-project/vllm/issues/5907
There is no fix yet, but the current workaround is to use --enable-chunked-prefill
. This reduces the meomry spike due to the limit on the number of tokens processed in a batch, but is not supported yet supported for all other parameter configurations.
Your current environment
🐛 Describe the bug
torch.cuda.OutOfMemoryError: CUDA out of memory when setting
prompt_logprobs=1
for a large batch