turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Command-R plus OOM 0.0.18 -> 0.0.19 #465

Open kennylin0309 opened 1 month ago

kennylin0309 commented 1 month ago

The exllamav2-0.0.18+cu121, Command-R plus 6.5 bpw works very well.

The exllamav2-0.0.19+cu121, Command-R plus 6.5 bpw OOM, so I use 6.0 bpw and it works.

The exllamav2-0.0.21+cu121, Command-R plus 6.0 bpw OOM. so I have to go back to 0.0.18.

in my setup 112GB VRAM.

Ph0rk0z commented 1 month ago

Did flash attention break on you? I am getting similar memory usage, even all the way up to git dev.

kennylin0309 commented 1 month ago

I disable flash attention, the result the same. The exllamav2-0.0.20+cu121 and Command-R plus 6.0 bpw OOM. I have to use 5.5 bpw if I want to use 0.0.20 or 0.0.21 or back to 0.0.18 so I can still use 6.5 bpw

turboderp commented 1 month ago

What frontend are you using? And can you try with v0.1.0 to see if it's any different?

kennylin0309 commented 1 month ago

text-generation-webui + SillyTavern. I tried using v0.1.0 x86_64 torch 2.2.0 compiled from source 6.5 bpw works again, but context length is 16K. ( with 0.0.18 is 32K and works very well ). I'm ok with 16K, at least I can use 6.5 bpw. If I want more than 32K I'll use 6.0/5.5 bpw. I appreciate your efforts.

turboderp commented 1 month ago

I recommend you give TabbyAPI a try.

kennylin0309 commented 1 month ago

I tried TabbyAPI (exllamav2 v0.1.0) + SillyTavern today. Only 5.5bpw with 16K context length works. 5.5bpw with 32K doesn't work. 6.0 bpw and 6.5 bpw doesn't work.

turboderp commented 1 month ago

What do your split settings look like? Auto or manual?

kennylin0309 commented 1 month ago

I tried auto split but it OOM only manual works TabbyAPI (exllamav2 v0.1.0) + SillyTavern, 5.5bpw, 16K context 5 GPU total 112 GB VRAM

max_seq_len: 16384 gpu_split_auto: false gpu_split: [16.5, 15, 15.5, 15.5, 16]

turboderp commented 1 month ago

Couple of things to clear up:

Here are some more things to try: