File "vllm/model_executor/layers/sampler.py", line 98, in forward
logits.div_(sampling_tensors.temperatures.unsqueeze_(dim=1))
RuntimeError: The size of tensor a (5) must match the size of tensor b (117) at non-singleton dimension 0
I think the problem comes from that logits up to 112(16*7blocks) is prefix-cached, and only the last 5 input tokens are computed. To return the prompt logprobs, the sampler is looking for all 117 logits while only recently calculated 5 logits are returned there. It seems the cached 112 logits need to be returned as well. I don't know how...
I think the problem comes from that logits up to 112(16*7blocks) is prefix-cached, and only the last 5 input tokens are computed. To return the prompt logprobs, the sampler is looking for all 117 logits while only recently calculated 5 logits are returned there. It seems the cached 112 logits need to be returned as well. I don't know how...