Command-R plus OOM 0.0.18 -> 0.0.19

kennylin0309 commented 1 month ago

The exllamav2-0.0.18+cu121, Command-R plus 6.5 bpw works very well.

The exllamav2-0.0.19+cu121, Command-R plus 6.5 bpw OOM, so I use 6.0 bpw and it works.

The exllamav2-0.0.21+cu121, Command-R plus 6.0 bpw OOM. so I have to go back to 0.0.18.

in my setup 112GB VRAM.

Ph0rk0z commented 1 month ago

Did flash attention break on you? I am getting similar memory usage, even all the way up to git dev.

kennylin0309 commented 1 month ago

I disable flash attention, the result the same. The exllamav2-0.0.20+cu121 and Command-R plus 6.0 bpw OOM. I have to use 5.5 bpw if I want to use 0.0.20 or 0.0.21 or back to 0.0.18 so I can still use 6.5 bpw

turboderp commented 1 month ago

What frontend are you using? And can you try with v0.1.0 to see if it's any different?

kennylin0309 commented 1 month ago

text-generation-webui + SillyTavern. I tried using v0.1.0 x86_64 torch 2.2.0 compiled from source 6.5 bpw works again, but context length is 16K. ( with 0.0.18 is 32K and works very well ). I'm ok with 16K, at least I can use 6.5 bpw. If I want more than 32K I'll use 6.0/5.5 bpw. I appreciate your efforts.

turboderp commented 1 month ago

I recommend you give TabbyAPI a try.

kennylin0309 commented 1 month ago

I tried TabbyAPI (exllamav2 v0.1.0) + SillyTavern today. Only 5.5bpw with 16K context length works. 5.5bpw with 32K doesn't work. 6.0 bpw and 6.5 bpw doesn't work.

turboderp commented 1 month ago

What do your split settings look like? Auto or manual?

kennylin0309 commented 1 month ago

I tried auto split but it OOM only manual works TabbyAPI (exllamav2 v0.1.0) + SillyTavern, 5.5bpw, 16K context 5 GPU total 112 GB VRAM

max_seq_len: 16384 gpu_split_auto: false gpu_split: [16.5, 15, 15.5, 15.5, 16]

turboderp commented 1 month ago

Couple of things to clear up:

Is this Windows or Linux?
What GPUs are you using?
Are you using FP16 cache or Q4?
When you do get an OoM error, what's the error exactly? Specifically, which device does it report is running out? This is especially relevant for auto split which should never cause an OoM error but instead report that the model doesn't fit.
Are you using Flash Attention? I think TGW installs it in its venv when you run the launch script, whereas Tabby doesn't. pip show flash-attn

Here are some more things to try:

Try a GPU split of [16.5, 15, 15.5, 15.5, 22] instead. There's no reason to limit the allocation for the last device.
Run a configuration that works and check the output of nvidia-smi to see what memory usage you end up with. If there's room left over on a device you can increase its allocation a bit.
The chunk_size config option in Tabby can be reduced, which lowers the overhead per device. Especially relevant if you have five GPUs. Try a value of 1024 maybe. (Not sure what TGW defaults to for this.)

turboderp / exllamav2

Command-R plus OOM 0.0.18 -> 0.0.19 #465