generate_simple still having issues with eos_token_id

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.2k stars 235 forks source link

generate_simple still having issues with eos_token_id #385

Closed shensmobile closed 3 weeks ago

shensmobile commented 3 months ago

I'm trying to get a simple inference script running to test exllamav2 outside of text-generation-webui. In oobabooga, everything appears to be running basically perfectly, but I am not having nearly the same results in a standalone script.

generate_simple() appears to always be generating to the full length of max_new_tokens. I had to remove "settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])" from the setting configuration. I see that generate_simple() does respect the eos of speech token now (there was another issue (https://github.com/turboderp/exllamav2/issues/9) where turboderp suggested manually setting stop condition in generator, but that appears to no longer be relevant).

It's "working" right now with the above change (IE it does stop early), but I'm wondering if I'm setting up inference code incorrectly. My code is essentially:

model_directory = <path to mixtral-instruct>
config = ExLlamaV2Config()
config.model_dir = model_directory
config.max_input_len=8192
config.prepare()
model = ExLlamaV2(config)

cache = ExLlamaV2Cache_Q4(model, lazy = True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.5
settings.top_k = 50
settings.top_p = 0.9
settings.min_p = 0.05
settings.token_repetition_penalty = 1.15
#settings.disallow_tokens(tokenizer, [tokenizer.eos_token_id])

output = generator.generate_simple(prompt, settings, 128)

The reason I'm questioning my own sanity right now is that my results from this script and oobabooga are often quite different. I'm asking mixtral-instruct to follow tightly guardrailed instructions, and the output from oobabooga is nearly perfect, but from my script, it's really hit/miss. That's why I feel like my inference script is missing something critical (which could be causing the generator to be outputting the full 128 tokens).

Also, is there a way for generate_simple to only output the response, and not include the prompt? I know I can manually remove the prompt, but it seems bizarre for generate_simple to always return the full text, including the original prompt.

turboderp commented 3 months ago

You're initializing the ExLlamaV2Config object a little incorrectly. max_input_len is the max length for a single forward pass, not the total context length. For this you'll want to change max_seq_len after calling prepare(), which loads all the default settings (including context length) from config.json. If you want to increase max_input_len for more efficient prompt processing you should also set max_attention_size to 8192**2. I think what you probably want is:

config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()
config.max_seq_len=8192
model = ExLlamaV2(config)

With the latest version (dev) this also simplifies to:

config = ExLlamaV2Config(model_directory)
config.max_seq_len=8192
model = ExLlamaV2(config)

As for why the results are unreliable compared to ooba, it's probably down to prompt formatting. Mixtral expects a Llama-style prompt like <s>[INST] <<SYS>>\nYou are a chatbot etc...\n<</SYS>>\n\nWhat's the deal with dogs? [/INST]. To get the <s> added properly you'll want to either include it in the prompt and set encode_special_tokens = True in the call to generate_simple, or you can set add_bos = True and it will be prepended when tokenizing.

The choice of whether to return the completed text or just the completion is arbitrary. There are cases where you'd want one and cases where you'd want the other. I'll think about adding an option to control it.

turboderp commented 3 months ago

FWIW, here's an example of how you could use Mixtral with correct formatting:

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache_Q4, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

model_directory = "/mnt/str/models/mixtral-8x7b-instruct-exl2/3.5bpw/"
config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()
config.max_seq_len = 8192
model = ExLlamaV2(config)

cache = ExLlamaV2Cache_Q4(model, lazy = True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.5
settings.top_k = 50
settings.top_p = 0.9
settings.min_p = 0.05
settings.token_repetition_penalty = 1.15

system_prompt = "You are a helpful AI assistant."
question = "Why is there no atmosphere on the Moon?"
prompt = f"[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{question} [/INST]"

output = generator.generate_simple(prompt, settings, 256, add_bos = True)
response = output[len(prompt):]

print(f"Q: {question}")
print(f"A: {response}")

I increased num_tokens a bit because Mixtral can be a little talkative and 128 tokens isn't an awful lot. I get:

Q: Why is there no atmosphere on the Moon?
A:  The moon lacks an atmosphere primarily because it doesn't have enough gravity to hold onto the gases that might otherwise form one. The Earth, in contrast, has a strong gravitational pull which allows it to retain an atmosphere. 

The moon's escape velocity - the speed needed for an object to escape its gravitational field - is only about 2.38 kilometers per second, compared to the Earth's 11.2 kilometers per second. This means that any gas molecules present on the moon would be quickly lost to space due to the moon's lower gravity. 

Furthermore, the moon lacks a magnetic field, which on Earth helps to deflect solar wind and radiation away from our planet's atmosphere. Without such protection, the solar wind would strip away any significant gaseous formation on the moon.

shensmobile commented 3 months ago

Thanks for the heads up on max_seq_length! I've made that change now. I may have also been a bit too harsh on calling the performance between ooba and exllamav2 different; I started seeing similar instruction-breaking performance after more testing in Ooba. Is there a method guide/documentation for exllama? I'd love to be able to print out the default settings in the sampler to compare to what I have in Ooba to make sure I'm comparing apples to apples.

And you're right, it's easy enough to trim out the prompt from the output, I was just curious if I was missing something or setting up something wrong and the entire prompt was eating up my output tokens or something. Thanks for the example script, that's perfect!

shensmobile commented 2 months ago

Sorry to bring this back up, and maybe this should be its own topic, but I've been trying to figure out why I get different results between Oobabooga's API and UI, and decided to test Exllama directly in code, and I'm just unable to getan Exl2 quant of Qwen1.5-14B to run. In Ooba, I have it set to a max context length of 8192 (no 4 bit or 8 bit cache) and it loads into my 4090 using about 19GB of VRAM.

When I duplicate these same settings (using the settings above) in a script, it maxes out my memory and says there's insufficient VRAM. What is it that Ooba is doing that I'm not calling correctly in code? I added "max_seq_len=8192" to Q4 cache and it works, but no such luck adding it to non-quantized cache.

turboderp commented 2 months ago

Firstly, you should set max_seq_len in the ExLlamaV2Config (before loading the model), not in the cache unless you're doing something special with it. And thirdly, especially for models like Qwen and if you only want to use it for generating text, you should reduce config.max_output_len (16 is a sensible value). This limits how much space the implementation has to allocate for logits.

If this doesn't work, could you paste the script?

shensmobile commented 2 months ago

from exllamav2 import(
    ExLlamaV2,
    ExLlamaV2Config,
    ExLlamaV2Cache,
    ExLlamaV2Cache_Q4,
    ExLlamaV2Tokenizer,
)

from exllamav2.generator import (
    ExLlamaV2BaseGenerator,
    ExLlamaV2Sampler
)

model_directory = "/home/server2/text-generation-webui-main/models/LoneStriker_Qwen1.5-14B-Chat-8.0bpw-h8-exl2"
config = ExLlamaV2Config()
config.model_dir = model_directory
config.max_sequence_len=8192
config.max_output_len = 16 # Added this as per recommendation
config.prepare()
model = ExLlamaV2(config)

#cache = ExLlamaV2Cache_Q4(model, lazy=True)
cache = ExLlamaV2Cache(model, lazy = True)
model.load_autosplit(cache) # <-Crashes here

It's still running out of VRAM. Definitely the same model that Ooba is able to load, with the same settings.

turboderp commented 2 months ago

The attribute should be max_seq_len, not max_sequence_len. And also, this should be after prepare since that function loads all the defaults from config.json. The new method is a little cleaner:

config = ExLlamaV2Config(model_directory)
config.max_seq_len=8192
config.max_output_len = 16 # Added this as per recommendation
model = ExLlamaV2(config)

shensmobile commented 2 months ago

Ah what an annoying mistake. Thanks for catching that!

I'm still not getting the same results that I'm getting from Ooba's UI but I think I can debug on my own for a bit. I just have to figure out what other settings that Ooba sets that I'm not. Really appreciate your help!