Truncated response -- repro code

pseudotensor commented 10 months ago

We noticed mixtral behaving oddly, and narrow down to a (maybe) 100% repro on 0.2.7. Script is in the zip file. Just replace base_url's FILLIN with your endpoint.

testmixnew1.py.zip

Mixtral was run like:

export CUDA_HOME=/usr/local/cuda-12.3
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu123"
pip install git+https://github.com/vllm-project/vllm.git
pip install mosaicml-turbo --upgrade
pip install git+https://github.com/stanford-futuredata/megablocks.git
pip install fschat==0.2.34
export CUDA_VISIBLE_DEVICES=6,7

python -m vllm.entrypoints.openai.api_server --port=5002 --host=0.0.0.0 --model mistralai/Mixtral-8x7B-Instruct-v0.1 --seed 1234 --tensor-parallel-size=2 --max-num-batched-tokens=163840

The output is:

 The Commonwealth Bank of Australia (CBA) reported strong financial results for the first half of fiscal year 2

This is a bad output compared to normal as it is truncated. The server says it was a normal stop, but I don't believe it.

The prompt we used is a bit odd in order to repro what we see with normal prompts, so ignore that aspect.

There are several \u encodings in the text, which I'm worried about that leads to premature stop.

pseudotensor commented 10 months ago

Here's simpler repro, that happens about 90% of time.

prompt_llm = """<s>[INST] In order to write a concise single-paragraph summary, pay attention to the following text:

\"\"\"
 The Commonwealth Bank of Australia (CBA) reported strong financial results for the first half of fiscal year 2023, with a statutory net profit after tax of AUD 5.216 billion, up 10% from the same period last year. Cash net profit after tax stood at AUD 5.153 billion, a 9% increase. Operating performance also improved by 18% to AUD 7.820 billion. The bank's home and consumer lending gross lending reached AUD 77 billion, while business and corporate lending gross lending amounted to AUD 18 billion. CBA's net promoter scores (NPS) remained high, with the bank ranking first in the consumer, business, and institutional categories. The bank's liquid assets and deposit funding increased, and its weighted average maturity stood at 5.8 years. CBA's CET1 ratio was 11.4%, and it declared a dividend per share of AUD 2.10 (35 cents). However, the bank warned that forward-looking statements should be treated with caution due to current economic uncertainties and geopolitical risks.
\"\"\"
Using only the text above, write a condensed and concise summary of key results (preferably as one paragraph):
 [/INST]"""

base_url = 'FILLME'
base_model = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
api_key = 'EMPTY'
stream_output = False

client_kwargs = dict(model=base_model,
                     max_tokens=1024,
                     temperature=0,
                     stream=stream_output)

from openai import OpenAI, AsyncOpenAI
cls = OpenAI
client_args = dict(base_url=base_url, api_key=api_key)
openai_client = cls(**client_args)
client = openai_client.completions
client_kwargs.update(dict(prompt=prompt_llm))
responses = client.create(**client_kwargs)
text = responses.choices[0].text
print(text)

gives:

 The Commonwealth Bank of Australia (CBA) announced robust financial results for the first half of fiscal year 2

pseudotensor commented 10 months ago

Until I see otherwise, I'm going to assume the strict model card with space between <s> and [INST] is required as they say, until mistral models that have no space. With that change these particular cases do not have issues. Will re-open if see others.

palazski commented 9 months ago

I am experiencing the same thing with OpenHermes2.5-Mistral 7B AWQ. Chat template fix (I was applying ChatML by hand, turned it into tokenizer.apply_chat_template) didn't seem to fix it. Anyone has any fix?

vibhuagrawal14 commented 9 months ago

@pseudotensor can you please reopen this issue? I too am facing this with Mixtral. Trying to generate JSONs, and they often get truncated, always ending at character "2", just like in your case (while trying to generate years like 2023 and 2024)

vibhuagrawal14 commented 9 months ago

@WoosukKwon very interesting/maddening bug!

pseudotensor commented 9 months ago

Sure I re-opened. I agree it's unlikely the prompt change should have mattered so much.

chemrahul82 commented 8 months ago

@vibhuagrawal14 I am seeing exactly the same bug. While writing years or dates, it stops at 2. This is for Mixtral model. @pseudotensor : any fixes or suggestions? Thanks.

chemrahul82 commented 8 months ago

@pseudotensor fixing the spacing between BOS string and [INST] does appear to have fixed the issue. Thanks.

zoltan-fedor commented 7 months ago

I actually needed to use double spaces between the BOS (<s>) and [INST] for it to work, although in my case it truncated the response at numbers other than number 2

abhibisht89 commented 7 months ago

we do face this issues , any workaround

gmittal commented 7 months ago

Adding a space between BOS and [INST] fixes this issue for us as well.

simon-mo commented 7 months ago

This sounds like a fix is needed in the chat template? https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/blob/1e637f2d7cb0a9d6fb1922f305cb784995190a83/tokenizer_config.json#L42

Here's a fix https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/176/files but waiting for the mistral team.

You can load the fixed version of the chat template here in vLLM: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#chat-template

keskival commented 7 months ago

See also this discussion: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/182 The spaces almost totally fix the issue, but not completely. It seems to arise from Mistral training corpus, which likely includes corrupted files.

vllm-project / vllm

Truncated response -- repro code #2464