turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.73k stars 283 forks source link

Mistral fails/garbage at context > 8192, transformers works fine #245

Closed matatonic closed 11 months ago

matatonic commented 11 months ago

I've been trying a few of the new mistral merges (Toten5/Marcoroni-neural-chat-7B-v1, OpenPipe/mistral-ft-optimized-1218) and am having the same problem where the original model works fine in transformers up to 32k context, but in exllamav2 it only works up to about 8k context, after that all I get is gibberish. This is using the same generation settings (min_p 0.05, temp 0.7), I get the same behavior with an exl2 quant (5 and 6 bits tested so far). This happens with exllamav2 0.0.11 via text-generation-webui. Not sure what's wrong here, maybe a rope scale issue?

Example response when it fails: new Q a Q Question1 a Q Question1 a Q a Q Question3 Q a a the Q a Question2 Question1 the Q Question1 Question1 the the Q Q Q Q Q Q Q a..... etc.

turboderp commented 11 months ago

It's not a scaling issue, it's because Mistral is an 8k model. It continues to output after 8k in Transformers because Transformers uses it as a 32k model with a 4k sliding window. I.e. it's only ever attending to the last 4k tokens of the context (sliding window attention). I've just not implemented that, mainly because there's no straightforward way to do it transparently (it complicates the interface), Mistral is the only model using it and the benefits are dubious compared to just truncating the context periodically.

matatonic commented 11 months ago

got it, thanks!