[Bug]: vllm 0.6.3 generates incomplete/repeated answers for long length (over 8k) input

tf-ninja commented 5 days ago

Your current environment

vllm 0.6.3

Model Input Dumps

The input is long context with over 8k tokens

🐛 Describe the bug

vllm 0.6.2 does not have this bug.

We are running vllm 0.6.3 with speculative decoding. When we input long context (over 8k) into the model, the output is truncated and gives incomplete answers. The command we are using is

python -m vllm.entrypoints.openai.api_server  --host 0.0.0.0  --port 8083  --model /home/downloaded_model/Llama-3.2-3B-Instruct/  --speculative_model /home/downloaded_model/Llama-3.2-1B-Instruct/  --served-model-name  LLM  --tensor-parallel-size 8  --max-model-len 34336  --max-num-seqs 128  --enable-prefix-caching --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --gpu_memory_utilization 0.95  --spec-decoding-acceptance-method typical_acceptance_sampler

We then run vllm 0.6.3 without speculative decoding, but we still get incomplete answers or repeated answers. The command we use is

python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model /home/downloaded_model/Llama-3.2-3B-Instruct/ --served-model-name  LLM --tensor-parallel-size 8 --max-model-len 34336 --max-num-seqs 128 --enable-prefix-caching --enable_chunked_prefill --disable-log-requests --seed 42 --gpu_memory_utilization 0.95

How we call vllm model is as below

def call_vllm_api(message_log):
vllm_client = openai.OpenAI(api_key=API_KEY, base_url=BASE_URL)

response = vllm_client.chat.completions.create(
    model="LLM",
    messages=message_log,
    max_tokens=4096,
    temperature=0.2,
    presence_penalty=0,
    frequency_penalty=0,
)

response_content = response.choices[0].message.content

return response_content

tf-ninja commented 4 days ago

As mentioned in this issue #9417, it works with --enforce-eager

yudian0504 commented 1 day ago

+1

Jason-CKY commented 1 day ago

As mentioned in this issue #9417, it works with --enforce-eager

I'm having the same issue. Running with --enforce-eager fixes this issue for now.

vllm-project / vllm