Open kennylin0309 opened 1 month ago
Did flash attention break on you? I am getting similar memory usage, even all the way up to git dev.
I disable flash attention, the result the same. The exllamav2-0.0.20+cu121 and Command-R plus 6.0 bpw OOM. I have to use 5.5 bpw if I want to use 0.0.20 or 0.0.21 or back to 0.0.18 so I can still use 6.5 bpw
What frontend are you using? And can you try with v0.1.0 to see if it's any different?
text-generation-webui + SillyTavern. I tried using v0.1.0 x86_64 torch 2.2.0 compiled from source 6.5 bpw works again, but context length is 16K. ( with 0.0.18 is 32K and works very well ). I'm ok with 16K, at least I can use 6.5 bpw. If I want more than 32K I'll use 6.0/5.5 bpw. I appreciate your efforts.
I tried TabbyAPI (exllamav2 v0.1.0) + SillyTavern today. Only 5.5bpw with 16K context length works. 5.5bpw with 32K doesn't work. 6.0 bpw and 6.5 bpw doesn't work.
What do your split settings look like? Auto or manual?
I tried auto split but it OOM only manual works TabbyAPI (exllamav2 v0.1.0) + SillyTavern, 5.5bpw, 16K context 5 GPU total 112 GB VRAM
max_seq_len: 16384 gpu_split_auto: false gpu_split: [16.5, 15, 15.5, 15.5, 16]
Couple of things to clear up:
pip show flash-attn
Here are some more things to try:
nvidia-smi
to see what memory usage you end up with. If there's room left over on a device you can increase its allocation a bit.chunk_size
config option in Tabby can be reduced, which lowers the overhead per device. Especially relevant if you have five GPUs. Try a value of 1024 maybe. (Not sure what TGW defaults to for this.)
The exllamav2-0.0.18+cu121, Command-R plus 6.5 bpw works very well.
The exllamav2-0.0.19+cu121, Command-R plus 6.5 bpw OOM, so I use 6.0 bpw and it works.
The exllamav2-0.0.21+cu121, Command-R plus 6.0 bpw OOM. so I have to go back to 0.0.18.
in my setup 112GB VRAM.