turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.69k stars 283 forks source link

[BUG] chat-instruct Llama 3.1 end word "assistant " #632

Closed Katehuuh closed 2 months ago

Katehuuh commented 2 months ago

OS

Windows

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

3.10.8

Model

turboderp/Llama-3.1-8B-Instruct-exl2

Describe the bug

I always receive assistant at the end of each sentence.

image

The bug only occurs with Llama-3.1 family. The following models were tested:

I am using oobabooga/text-gen, and the issue only occurs in chat-instruct mode (the chat mode works correctly).

INFO     Loading "bartowski_Meta-Llama-3.1-8B-Instruct-exl2_8_0"
INFO     Loaded "bartowski_Meta-Llama-3.1-8B-Instruct-exl2_8_0" in 18.84 seconds.
INFO     LOADER: "ExLlamav2"
INFO     TRUNCATION LENGTH: 131072
INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"

This also happens with the Llama-v3 and Instruction template. The model loader ExLlamav2_HF and Transformers work correctly.

Reproduction steps

To clarify, using the official meta-llama/Meta-Llama-3.1-8B-Instruct with ExLlamav2 as the loader causes the issue.

Expected behavior

.

Logs

No response

Additional context

No response

Acknowledgements

turboderp commented 2 months ago

This is likely an instruct template issue. Not sure how the TGW loader works with templates, but probably you can fix it by modifying the config.json to only list token 128009 under "eos_token_id".

Katehuuh commented 2 months ago

This is likely an instruct template issue. Not sure how the TGW loader works with templates, but probably you can fix it by modifying the config.json to only list token 128009 under "eos_token_id".

It work, Thanks. Should this be a templates loader issue for oobabooga then?

Katehuuh commented 2 months ago

@turboderp by remove eos_token_id: 128008 and 128009, will it cause further issues?

turboderp commented 2 months ago

It's really up to the frontend to specify what the stop conditions are, as part of the instruct template. But because HF has a very confused format, these conflicts occur every now and again. ExLlama has a single token which is used as a default stop condition, so it doesn't really know what to do with models that decided they wanted multiple stop tokens all of a sudden. The frontend can still set as many stop conditions as it likes, though, to suit whatever instruct format it decides to use.

Switching to 128009 works for Llama3 specifically because that token marks the end of model responses in the L3 instruct template, and some frontends assume that a) model responses are supposed to end with EOS and b) models only define a single EOS token. Making matters a little more complicated, L3 had some errors in its config when it first launched (defining <|end_of_text|> in tokenizer_config.json instead of <|eot_id|> etc., and even as the models have been updated those changes aren't always reflected in the many quantized versions already on HF.

But to be clear, the eos_token_id value in config.json is only really used as a default, and it's more of a suggestion. Changing it shouldn't break anything.