Open tf-ninja opened 5 days ago
As mentioned in this issue #9417, it works with --enforce-eager
+1
As mentioned in this issue #9417, it works with
--enforce-eager
I'm having the same issue. Running with --enforce-eager
fixes this issue for now.
Your current environment
vllm 0.6.3
Model Input Dumps
The input is long context with over 8k tokens
🐛 Describe the bug
vllm 0.6.2 does not have this bug.
We are running vllm 0.6.3 with speculative decoding. When we input long context (over 8k) into the model, the output is truncated and gives incomplete answers. The command we are using is
We then run vllm 0.6.3 without speculative decoding, but we still get incomplete answers or repeated answers. The command we use is
How we call vllm model is as below