Closed pangr closed 9 months ago
@pangr why did you close it? I still have this issue
@pangr why did you close it? I still have this issue
@pangr ah, sorry. It's not exllama 2 repo...
It's really up to the model to decide when it's done talking. First thing is to make sure the model actually has some concept of a user-assistant exchange in the first place, and that it predicts an EOS token at the end of the assistant's reply. This all depends on how the model was finetuned and how you're prompting it.
Llama2, the base model, doesn't know any particular format for interacting with users, and while it may be able to follow along sometimes, there aren't any guarantees. It will try to find patterns in the context and repeat those patterns, and although it can recognize very advanced patterns sometimes like `User: {question}\nAssistant: {factual answer to that specific question}" it really needs finetuning if you want it to work reliably as a artificial intelligence, especially if you want special tokens in very specific places.
In lieu of finetuning, you can still prime a base model it with multiple examples to try to teach it, in-context, what the format of the exchange is. Think of the base model as a universal text predictor, which is really good at guessing what comes next in a sequence but has no idea what you actually want from it if it isn't obvious from the prompt you give it. Examples can help, e.g.:
USER: Hello.</s>
ASSISTANT: Hi, I'm a virtual assistant. How may I help you</s>
USER: What's the tallest building in Paris?</s>
ASSISTANT: The tallest building in Paris is the Eiffel Tower.</s>
USER: Thanks!</s>
ASSISTANT: You're welcome. Is there anything else I can help you with?</s>
USER: {first actual user prompt goes here}</s>
ASSISTANT: {generation starts from here...}
This of course still has some problems. For one, the </s>
token might already have a very special meaning to the base model, so you can end up confusing it. For instance, it might have never seen anything in its pretraining that followed an EOS token, or it might have only ever seen <s>
as the next token, so anything else has a zero probability and you end up with undefined behavior.
Personally I find it better not to rely on the EOS token for that reason, if you're working with a base model. I'd use something like the above, but without the tokens instead relying on the string "USER:" as the stop condition. That tends to be much more reliable.
Keep in mind that the language model doesn't know that it's playing the role of "ASSISTANT:" and will happily predict more questions from "USER:" if you let it. It may also decide that the most logical continuation is something like:
...
ASSISTANT: Is there anything else I can help you with?
USER: Not right now, thanks.
ASSISTANT: Thank you for using our services.
END
Here we see an example of a successful interaction between a random user and our AI assistant. As is evident, the new version is more polite than previous versions, though also less verbose. We discuss this tradeoff further in section 4.
Section 3: Safety and alignment
...
Because that's very conceivably like some of the model's pretraining data, and it's an entirely plausible way to continue the prediction. For more well-defined behavior you'd want to use a finetuned model with a specific prompt format, then stick to that format very carefully.
@turboderp i found why the 2nd exllama was not outputting the eos token. Because of this:
settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])
When I removed it, everything started working as expected.
In our application deployment dialog, there will be a newline in the output, so we cannot use a newline as the end character, and can only use the EOS id as the end identifier. However, using this framework, no matter what the input and parameters are, the EOS id cannot be output.