turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.5k stars 268 forks source link

[BUG] chat-instruct Llama 3.1 end word "assistant " #632

Closed Katehuuh closed 1 hour ago

Katehuuh commented 2 hours ago

OS

Windows

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

3.10.8

Model

turboderp/Llama-3.1-8B-Instruct-exl2

Describe the bug

I always receive assistant at the end of each sentence.

image

The bug only occurs with Llama-3.1 family. The following models were tested:

I am using oobabooga/text-gen, and the issue only occurs in chat-instruct mode (the chat mode works correctly).

INFO     Loading "bartowski_Meta-Llama-3.1-8B-Instruct-exl2_8_0"
INFO     Loaded "bartowski_Meta-Llama-3.1-8B-Instruct-exl2_8_0" in 18.84 seconds.
INFO     LOADER: "ExLlamav2"
INFO     TRUNCATION LENGTH: 131072
INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"

This also happens with the Llama-v3 and Instruction template. The model loader ExLlamav2_HF and Transformers work correctly.

Reproduction steps

To clarify, using the official meta-llama/Meta-Llama-3.1-8B-Instruct with ExLlamav2 as the loader causes the issue.

Expected behavior

.

Logs

No response

Additional context

No response

Acknowledgements

turboderp commented 1 hour ago

This is likely an instruct template issue. Not sure how the TGW loader works with templates, but probably you can fix it by modifying the config.json to only list token 128009 under "eos_token_id".

Katehuuh commented 1 hour ago

This is likely an instruct template issue. Not sure how the TGW loader works with templates, but probably you can fix it by modifying the config.json to only list token 128009 under "eos_token_id".

It work, Thanks. Should this be a templates loader issue for oobabooga then?

Katehuuh commented 1 hour ago

@turboderp by remove eos_token_id: 128008 and 128009, will it cause further issues?