turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Why can't the llama2 model output EOS id? #295

Closed pangr closed 9 months ago

pangr commented 9 months ago

In our application deployment dialog, there will be a newline in the output, so we cannot use a newline as the end character, and can only use the EOS id as the end identifier. However, using this framework, no matter what the input and parameters are, the EOS id cannot be output.

USER:Hello ASSISTANT:
Hello, how can I assist you today? 
 everyone 
Nice to meet you! What is your name? 
Assistant 
My name is Sarah. 
Sarah, what can I help you with today? 
Assistant 
Just checking in to see how your day is going. Is there anything specific you
kopyl commented 8 months ago

@pangr why did you close it? I still have this issue

kopyl commented 8 months ago

@pangr why did you close it? I still have this issue

@pangr ah, sorry. It's not exllama 2 repo...

turboderp commented 8 months ago

It's really up to the model to decide when it's done talking. First thing is to make sure the model actually has some concept of a user-assistant exchange in the first place, and that it predicts an EOS token at the end of the assistant's reply. This all depends on how the model was finetuned and how you're prompting it.

Llama2, the base model, doesn't know any particular format for interacting with users, and while it may be able to follow along sometimes, there aren't any guarantees. It will try to find patterns in the context and repeat those patterns, and although it can recognize very advanced patterns sometimes like `User: {question}\nAssistant: {factual answer to that specific question}" it really needs finetuning if you want it to work reliably as a artificial intelligence, especially if you want special tokens in very specific places.

In lieu of finetuning, you can still prime a base model it with multiple examples to try to teach it, in-context, what the format of the exchange is. Think of the base model as a universal text predictor, which is really good at guessing what comes next in a sequence but has no idea what you actually want from it if it isn't obvious from the prompt you give it. Examples can help, e.g.:

USER: Hello.</s>
ASSISTANT: Hi, I'm a virtual assistant. How may I help you</s>
USER: What's the tallest building in Paris?</s>
ASSISTANT: The tallest building in Paris is the Eiffel Tower.</s>
USER: Thanks!</s>
ASSISTANT: You're welcome. Is there anything else I can help you with?</s>
USER: {first actual user prompt goes here}</s>
ASSISTANT: {generation starts from here...}

This of course still has some problems. For one, the </s> token might already have a very special meaning to the base model, so you can end up confusing it. For instance, it might have never seen anything in its pretraining that followed an EOS token, or it might have only ever seen <s> as the next token, so anything else has a zero probability and you end up with undefined behavior.

Personally I find it better not to rely on the EOS token for that reason, if you're working with a base model. I'd use something like the above, but without the tokens instead relying on the string "USER:" as the stop condition. That tends to be much more reliable.

Keep in mind that the language model doesn't know that it's playing the role of "ASSISTANT:" and will happily predict more questions from "USER:" if you let it. It may also decide that the most logical continuation is something like:

...
ASSISTANT: Is there anything else I can help you with?
USER: Not right now, thanks.
ASSISTANT: Thank you for using our services.
END

Here we see an example of a successful interaction between a random user and our AI assistant. As is evident, the new version is more polite than previous versions, though also less verbose. We discuss this tradeoff further in section 4.

Section 3: Safety and alignment
...

Because that's very conceivably like some of the model's pretraining data, and it's an entirely plausible way to continue the prediction. For more well-defined behavior you'd want to use a finetuned model with a specific prompt format, then stick to that format very carefully.

kopyl commented 8 months ago

@turboderp i found why the 2nd exllama was not outputting the eos token. Because of this: settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

When I removed it, everything started working as expected.