turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.61k stars 279 forks source link

Severe model degradation observed when upgrading from v0.1.8 to v0.2.0 #610

Closed Thireus closed 1 month ago

Thireus commented 1 month ago

Using the same model and loading parameters I'm observing a severe degradation of the model's ability to understand my requests. Could it be related to tensor parallelism? I have to roll back to 0.1.8 in the meantime.

0zl commented 1 month ago

You should at least provide more information about the model and show some examples with its parameters

DocShotgun commented 1 month ago

To help diagnose your issue, it would be helpful to know:

What hardware are you using? What model/quant are you using? What settings are you loading the model with? What prompt/sampler settings/frontend are you generating with? What exactly is wrong with the model outputs compared to 0.1.8?

Ph0rk0z commented 1 month ago

I never noticed degradation but CR+ is broken now, maybe the problem isn't specific to it. Largestral, qwen and L3.1 appeared to work fine.

Thireus commented 1 month ago

Sorry I was not able to provide prompt examples as they involved a complex and large set of instructions which I cannot disclose. The model had trouble understanding all the instructions and appeared to only focus on the last portion of the instructions (almost ignoring the first and mid portion of the prompt).

My observations were based on turboderp_Llama-3.1-70B-Instruct-exl2_6.0bpw using --loader exllamav2_hf --max_seq_len 32768 --cache_4bit.

It appears that v0.2.1 resolves the issue.