turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Quantized LLama3 inference not working #435

Closed BenjaminGantenbein closed 2 months ago

BenjaminGantenbein commented 2 months ago

Hi!

First of all thanks for the nice repo!

I tried already many different proposed solutions from here:

https://github.com/oobabooga/text-generation-webui/issues/5885

but I always get either answers that are not stopping, or answer that are decorated with "ASSISTANT: Hi I am the assistant \ or "USER:  Hi I am the assistant \. or > Hi, I am the asisstant \

I tried changing the eos_token in the tokenizer_config. Also tried various different stopping token ids in exllamav2 chat_template. Also tried setting the encoding options to false, false, false. I am using a fine-tuned llama3-70b (with axolotl), then quantized over exllamav2. Anyone found a setup with tokenizer_config etc. that worked for them?

Thanks

turboderp commented 2 months ago

You should be able to turn off skip_special_tokens in the UI and set <|eot_id|> as a stop condition. Alternatively, changing eos_token_id from 128001 to 128009 in config.json. If this doesn't work then there's something else going on.

Do you have any idea what format that model was finetuned for? Is it finetuned with the Llama3-instruct template? Or finetuned from Llama3-instruct to some other format using extra tokens that aren't merged properly? It's hard to speculate as to why it's not working without those details.

BenjaminGantenbein commented 2 months ago

Thanks for the quick reply. I was using sharegpt format, but didn't add the eos token in the configuration file of axolotl. I guess this is the issue.