oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.25k stars 5.28k forks source link

ExLlama_HF and AutoAWQ not use the same Vram after tick no_inject_fused_attention #4605

Closed roseodimm closed 9 months ago

roseodimm commented 11 months ago

Describe the bug

I can run 20B and 30B GPTQ model with ExLlama_HF alpha_value = 1 compress_pos_emb = 1 max_seq_len = 4096 20B Vram 4,4,8,8 result 9-14 token per sec 30B Vram 2,2,8,8 result 4-6 token per sec

When switch AutoAWQ mode for AWQ version of the same model. cpu-memory in MiB = 0 max_seq_len = 4096 20B Vram 4096,4096,8190,8190 no_inject_fused_attention result "Cuda out of memory." after half half of the generate. 20B Vram 2048,2048,8190,8190 no_inject_fused_attention result 1-2 token per sec I can't run the model with no_inject_fused_attention un-check.

Did i do something wrong? Why AutoAWQ give me a bad result compare to ExLlama_HF.

I use RTX 2070s 8gb x4.

Is there an existing issue for this?

Reproduction

For GPTQ alpha_value = 1 compress_pos_emb = 1 max_seq_len = 4096 20B Vram 4,4,8,8 result 9-14 token per sec 30B Vram 2,2,8,8 result 4-6 token per sec

For AWQ cpu-memory in MiB = 0 max_seq_len = 4096 20B Vram 4096,4096,8190,8190 no_inject_fused_attention result "Cuda out of memory." 20B Vram 2048,2048,8190,8190 no_inject_fused_attention result 1-2 token per sec

Screenshot

No response

Logs

[[[[For GPTQ]]]]
2023-11-15 21:59:29 INFO:Loading TheBloke_Emerhyst-20B-GPTQ...
2023-11-15 21:59:42 INFO:Loaded the model in 12.28 seconds.
Output generated in 23.45 seconds (14.54 tokens/s, 341 tokens, context 22, seed 262973826)

[[[For AWQ]]]
C:\Users\PC\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\gradio\components\dropdown.py:231: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: 4 or set allow_custom_value=True.
  warnings.warn(
C:\Users\PC\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\gradio\components\dropdown.py:231: UserWarning: The value passed into gr.Dropdown() is not in the list of choices. Please update the list of choices to include: 128 or set allow_custom_value=True.
  warnings.warn(
2023-11-15 22:00:53 INFO:Loading TheBloke_Emerhyst-20B-AWQ...
Replacing layers...: 100%|█████████████████████████████████████████████████████████████| 62/62 [00:09<00:00,  6.63it/s]
2023-11-15 22:01:37 INFO:Loaded the model in 43.91 seconds.
Output generated in 137.14 seconds (2.79 tokens/s, 383 tokens, context 22, seed 110844991)

System Info

CPU AMD Ryzen Threadripper 1900X 8-Core Processor     3.80 GHz
GPU RTX2070s 8Gb x4
Ram 64gb
Window 10 pro x64
Ph0rk0z commented 11 months ago

AutoAWQ is not vram efficient.

casper-hansen commented 11 months ago

ExLlamav2 uses INT8 cache which makes it more memory efficient. AutoAWQ may do so in the future and may also use flash decoding instead of the current kernels.

blueyred commented 10 months ago

I have a similar issue, I have 2 GPU devices, a 24GB (device0) and an 8GB (device1). When attempting to load an AWQ it does not respect --gpu-memory setting or via the web ui. It always attempts to split the model symetrically across the 2 cards and hits "torch.cuda.OutOfMemoryError: CUDA out of memory" as it attempts to load too much into the smaller device.

casper-hansen commented 10 months ago

Have you tried this on the newest version?

pip install autoawq==0.1.7

If you set the device_map argument on AutoAWQForCausalLM.from_quantized, it will respect the argument when loading.

github-actions[bot] commented 9 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.