oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.94k stars 5.34k forks source link

Add AutoAWQ as backend #3782

Closed casper-hansen closed 1 year ago

casper-hansen commented 1 year ago

Description

I have created AutoAWQ as a package to more easily quantize and run inference for AWQ models. I wish to have AutoAWQ integrated into text-generation-webui to make it easier for people to use AWQ quantized models.

It works with a wide range of models and runs fast when you use a good CPU+GPU combination:

Model GPU Tokens/s
LLaMA-2-7B 4090 115.47
LLaMA-2-13B 4090 73.85
Vicuna-7B 4090 116.14
Vicuna-13B 4090 82.17
MPT-7B 4090 79.49
MPT-30B 4090 42.52
Falcon-7B 4090 50.40
LLaMA-2-7B A6000 80.35
LLaMA-2-13B A6000 49.36
Vicuna-7B A6000 80.45
Vicuna-13B A6000 57.80
MPT-7B A6000 59.29
MPT-30B A6000 31.68
Falcon-7B A6000 36.59

Additional Context

Package: https://github.com/casper-hansen/AutoAWQ Vicuna AWQ: https://huggingface.co/casperhansen/vicuna-7b-v1.5-awq MPT AWQ: https://huggingface.co/casperhansen/mpt-7b-8k-chat-gptq

Ph0rk0z commented 1 year ago

Should be trivial to add. Just copy the autoGPTQ loader or GPTQ loader files and make a PR. AutoAQW loader.

I think more than fast, does it perform better than GPTQ quants on perplexity and can it handle multi-gpu at all in a reasonable way.

Take a look at: https://github.com/oobabooga/text-generation-webui/blob/main/modules/AutoGPTQ_loader.py and https://github.com/oobabooga/text-generation-webui/blob/main/modules/GPTQ_loader.py

casper-hansen commented 1 year ago

Perplexity is equivalent or slightly better than GPTQ with reordering and much better than GPTQ without reordering.

Multi-GPU is not supported yet.

Ph0rk0z commented 1 year ago

Have you tried feeding it through accelerate? It would at least allow loading the 70b.

casper-hansen commented 1 year ago

Have you tried feeding it through accelerate? It would at least allow loading the 70b.

Let me correct myself, it does support multi-gpu but when running with fused layers, it does not work. However, the fused layers provide a 2.5x speedup, which makes it run fast.

Ph0rk0z commented 1 year ago

Will have to try it at some point and see how slow it gets. I want higher perplexity on 70b at decent speed, at least 10 t/s.

7/13b aren't super relevant. At least for me. They already run faster than reading speed and don't measure up, even in FP16.

loretoparisi commented 1 year ago

Will have to try it at some point and see how slow it gets. I want higher perplexity on 70b at decent speed, at least 10 t/s.

7/13b aren't super relevant. At least for me. They already run faster than reading speed and don't measure up, even in FP16.

this depends on the gpu you measure the 7B/13B. On the edge I think they should be taken into account and considered in the benchmarking of AutoAWQ as a baseline (I'm not mentioning the gpu poor here 🤣, but being serious they are relevant in context with constraints like WASM / WebGPU).

casper-hansen commented 1 year ago

Multi-GPU is now supported and I am working on a deeper FasterTransformer integration that will make the models even more speedy

rpeinl commented 1 year ago

I agree that on older GPUs the speed of 7B / 13B models DOES matter. I'm running textgeneration-web-UI on Windows 11 with an Nvidia 1080 8GB card and have ~3 tokens per second. Therefore, a speedup with the AWQ models would be much appreciated.

yhyu13 commented 1 year ago

@casper-hansen Would we be able to see its debut in textgen v1.7 release?

JackCloudman commented 1 year ago

some update? O.O

Ph0rk0z commented 1 year ago

Merge the PR and use it. I'm gonna try it and see how well it does on 70b vs Kquants and EXL2.

casper-hansen commented 1 year ago

I have actually not tested 70B models much as I do not have access to the right hardware. I would love to see some results. Speed should be good on Linux machines when fused layers are enabled. As soon as PR #3999 and AutoAWQ play well together, I believe it will be merged.

norton-chris commented 12 months ago

Multi-GPU is now supported and I am working on a deeper FasterTransformer integration that will make the models even more speedy

Is this multi-GPU support for AWQ on a different branch? Because AutoAWQ is still only using GPU0 and running over the limit I set for it. I have plenty of unused VRAM on GPU1 but it OOM errors. Trying to load TheBloke/deepseek-llm-67b-chat-AWQ. My setup is a 4090 (GPU0) and 3090 (GPU1).

DuckY-Y commented 11 months ago

yep still not fixed, ignoring MiB limits in the webui and using the wrong gpu

casper-hansen commented 11 months ago

I’m not sure how it’s implemented in ooba but AutoAWQ is compatible with using a device map which you can use to set which layers goes where

swizzcheeze commented 1 month ago

I don't understand on why It was decided to Remove AWQ as it was working fine albeit a few tweaks, I can see reason for a few other things that are Bugged vs AWQ loader.