turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Reply is too short #216

Closed hengjiUSTC closed 11 months ago

hengjiUSTC commented 11 months ago

Hey I tried to load models like https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ. But the reply is quiet short, is there any parameter that limit length?

截屏2023-08-02 下午2 15 14

I changed these parameters and don't seems to reply longer. I am curious about several parts:

  1. is this the problems of model
  2. should I prompt to make llm reply longer
  3. any other parameters to change in exllama?
EyeDeck commented 11 months ago

Usually mostly a prompt issue. LLaMA will imitate whatever pattern it sees and if all there is in context is

Chatbot: <one line>
User: <one line>
Chatbot: <one line>
etc

it's gonna keep imitating that pattern basically forever.

For the built-in ExLlama chatbot UI, I tried an experiment to see if I could gently break the model out of that specific pattern here: https://github.com/turboderp/exllama/pull/172 I find it works pretty well. Tends to work best to bump the min tokens slider up a little at a time until it starts producing a more desirable length, then just turn the slider off.

Technically its possible to ban the stop token completely, so the model has literally no choice but to keep going, but output quality suffers very very heavily in my experience.

turboderp commented 11 months ago

"Max tokens" is just the limit at which the generator cuts off the response. It doesn't affect the generation in any way.

"Chunk tokens" is the number of tokens it makes room for when truncating the context. E.g. if you have a sequence length of 2000 and a chunk size of 100, it will truncate the context to at most 1900 tokens before it starts generating the response. Then, if the response grows longer than 100 tokens, it will truncate the context again at that point, to 1900 tokens, and continue like that until a stop condition is met or until the total response reaches the "max tokens" limit.

You'll want to keep the chunk size fairly low so you're not discarding more of the context than you need to, but at the same time if it's too low you'll have frequent pauses, since truncating the context is relatively expensive (you have to reevaluate the entire thing from the beginning every time.)

Best way to get longer responses is to use a model and a prompt that likes to talk a lot. Disable "end on newline" if you want multi-line responses, and keep in mind that, as EyeDeck points out, the model is just trying to predict the continuation of some text. If the text looks like a chat between to people going "what's up?", "not much, hbu?", "same", "ok"... then the continuation of that chat will be more of the same. Different models/finetunes will have preferences for shorter or longer responses, but once a context starts to build up, one way or another, it always tends to override those preferences.

You could also try to edit some of the responses to "prime" the conversation in the direction you want.

dspasyuk commented 11 months ago

the problem you are seeing is because the "end on new line" is enable by default in the code, which is a terrible idea I must say. Took me an hour to figure this out. Also number of tokens in the example_chatbot.py is set to only 256. So to fix it, uncheck in webui "End on Newline" and in example_chabot use -nnl option to disable that and modify this line: max_response_tokens to something like 1024

turboderp commented 11 months ago

which is a terrible idea I must say

I find it's literally impossible to create a set of default settings that are a perfect fit for more than 1% of users or use cases. I have enough experience by now to say that I would be getting the same feedback if it was disabled by default instead, just from the other half of the users.

The chatbot example is just that, an example of how the generator can be used to create streaming output, how various stopping conditions can be implemented and so on. And short of having Clippy pop up to offer helpful suggestions, I think the web UI does what it can to make the available settings pretty apparent. (?)

Anyway, trying to shorten the list of issues, so I'll close for now.