vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.92k stars 4.3k forks source link

Error when prompt_logprobs + enable_prefix_caching #3251

Open bgyoon opened 7 months ago

bgyoon commented 7 months ago
image
  File "vllm/model_executor/layers/sampler.py", line 98, in forward
    logits.div_(sampling_tensors.temperatures.unsqueeze_(dim=1))
RuntimeError: The size of tensor a (5) must match the size of tensor b (117) at non-singleton dimension 0

I think the problem comes from that logits up to 112(16*7blocks) is prefix-cached, and only the last 5 input tokens are computed. To return the prompt logprobs, the sampler is looking for all 117 logits while only recently calculated 5 logits are returned there. It seems the cached 112 logits need to be returned as well. I don't know how...

DouHappy commented 7 months ago

same error.

thefirebanks commented 6 months ago

Same error here!