Open janphilippfranken opened 3 months ago
There is a pending PR trying to address this problem: https://github.com/vllm-project/vllm/pull/5355.
Meanwhile, you can try the chunked prefill feature which worked for me as a workaround: https://docs.vllm.ai/en/latest/models/performance.html#chunked-prefill.
would you mind sharing your code? let's say i have n_prompts=10
, and set prompt_logprobs=0
, i'd ideally get the logprobs for all 10 prompts using a single call model.generate(prompts=prompts, sampling_params=sampling_params)
.
Something like this: model = LLM(..., enable_chunked_prefill=True, max_num_batched_tokens=512, gpu_memory_utilization=0.9)
Try smaller values of gpu_memory_utilization
and/or max_num_batched_tokens
if you still see OOM.
Your current environment
🐛 Describe the bug
I have a standard setup like:
And running a function like:
works fine with very long prompts and a very large batch size.
However, as soon as I do something like
I.e.,
prompt_logprobs=1
instead of default (None), I immediately get OOM for the exact same prompts which does not make sense to me? It should just return the logprobs in addition to the generations but not affect things otherwise?My OOM error:
Note that the above works if len(prompts) == 1.