turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 273 forks source link

Smaug #339

Closed bdambrosio closed 7 months ago

bdambrosio commented 7 months ago

This is a dup of a post on HF:

just pulled latest exllamav2, 5-bit Qwen/Smaug won't load.(same behavior with 3-bit) I have 3X4090: at the very end of load it suddenly starts adding more to gpu1. I've tried all the way down to {16,19,23}. Initial load and allocations look ok, then blows up gpu1 vram. Plenty of room left on gpu2,3... Tried setting cache max_seq_len down to 8192, same behaviour Thanks for all your work on exllamav2!

bdambrosio commented 7 months ago

Update: got 3bit to load with: {10, 14, 23} actual final allocation with max_seq_len 8192: 19320, 22900, 10990 should it take 53GB for 3bit?

turboderp commented 7 months ago

Sounds about right. It's not a very efficient model since it doesn't use grouped-query attention (GQA). That makes context really expensive, 2.5 MB per token. Llama2-70B, for comparison, needs 320 kB per token.

bdambrosio commented 7 months ago

Ah. Makes sense, thanks. Got 3 and 4 bit to load w 8k context, seeing if I can juggle parameters for 5 bit. thanks for quick response.

Ph0rk0z commented 7 months ago

Wow, 4096 ctx is 10gb?

bdambrosio commented 7 months ago

not sure what you mean. I can load 4-bit Qwen in 3x4090 if I set context size to 8192.

Ph0rk0z commented 7 months ago

4-bit qwen chat fits in 2x3090 but only with 3600 context. The performance for me was milquetoast compared to MiquLiz. I'm not sure if smaug is any better in this regard since it's the older version. For 3x24, I 'd rather have MiquLiz. 120b, same bits, same context. Qwen chat also had strange lags while generating, I will confirm if that's due to me power limiting or if that's how it is when my p/s shows up this week.

turboderp commented 7 months ago

I'm not sure what would be causing lags, but it's possible you're experiencing the way the tokenizer combines multiple tokens into one character sometimes. I've worked out a way to deal with it but I still don't understand the scheme used by this model. It seems to be close to UTF-8, except not all the components are single-character tokens that map to ordinals in an 8-bit range. Perhaps they just need to be translated to some specific ASCII codepage and then converted to byte values? Not sure.

In any case, it means that in some situations (for instance with certain emojis or most Chinese characters) you'd get stutters when tokens are held in the generator until enough are collected to form a complete character that can be emitted as a string (so 20 tokens per second could mean 5-7 characters per second in those instances). Also, the only way I found to accomplish this is to run tokens that don't map to valid Unicode strings through the HF tokenizer which is super slow. And I mean like really slow, accounting for more latency than the language model itself.

No idea why the implementation is so inefficient, but presumably it's some bad algorithmic choices combined with the very large vocabulary. It's fine for most text which can be trivially decoded, and for actual UTF-8 characters. But I might have to add a wrapper for the Tiktoken library to avoid falling back on HF Tokenizers for all those other cases, idk.

And yes, 4096-token context is 10 GB. The full 32k context is 80 GB.

Ph0rk0z commented 7 months ago

Yea! That was emojis it would freeze on or some the CN characters. I would lag out waiting for a smiley face. I have to load it post that commit. I really hope they add GQA in the next release. it isn't a bad model but it really needs to have the advantage of a 72b.. right now you may as well run larger merges.