Closed casper-hansen closed 1 year ago
Should be trivial to add. Just copy the autoGPTQ loader or GPTQ loader files and make a PR. AutoAQW loader.
I think more than fast, does it perform better than GPTQ quants on perplexity and can it handle multi-gpu at all in a reasonable way.
Take a look at: https://github.com/oobabooga/text-generation-webui/blob/main/modules/AutoGPTQ_loader.py and https://github.com/oobabooga/text-generation-webui/blob/main/modules/GPTQ_loader.py
Perplexity is equivalent or slightly better than GPTQ with reordering and much better than GPTQ without reordering.
Multi-GPU is not supported yet.
Have you tried feeding it through accelerate? It would at least allow loading the 70b.
Have you tried feeding it through accelerate? It would at least allow loading the 70b.
Let me correct myself, it does support multi-gpu but when running with fused layers, it does not work. However, the fused layers provide a 2.5x speedup, which makes it run fast.
Will have to try it at some point and see how slow it gets. I want higher perplexity on 70b at decent speed, at least 10 t/s.
7/13b aren't super relevant. At least for me. They already run faster than reading speed and don't measure up, even in FP16.
Will have to try it at some point and see how slow it gets. I want higher perplexity on 70b at decent speed, at least 10 t/s.
7/13b aren't super relevant. At least for me. They already run faster than reading speed and don't measure up, even in FP16.
this depends on the gpu you measure the 7B/13B. On the edge I think they should be taken into account and considered in the benchmarking of AutoAWQ as a baseline (I'm not mentioning the gpu poor here 🤣, but being serious they are relevant in context with constraints like WASM / WebGPU).
Multi-GPU is now supported and I am working on a deeper FasterTransformer integration that will make the models even more speedy
I agree that on older GPUs the speed of 7B / 13B models DOES matter. I'm running textgeneration-web-UI on Windows 11 with an Nvidia 1080 8GB card and have ~3 tokens per second. Therefore, a speedup with the AWQ models would be much appreciated.
@casper-hansen Would we be able to see its debut in textgen v1.7 release?
some update? O.O
Merge the PR and use it. I'm gonna try it and see how well it does on 70b vs Kquants and EXL2.
I have actually not tested 70B models much as I do not have access to the right hardware. I would love to see some results. Speed should be good on Linux machines when fused layers are enabled. As soon as PR #3999 and AutoAWQ play well together, I believe it will be merged.
Multi-GPU is now supported and I am working on a deeper FasterTransformer integration that will make the models even more speedy
Is this multi-GPU support for AWQ on a different branch? Because AutoAWQ is still only using GPU0 and running over the limit I set for it. I have plenty of unused VRAM on GPU1 but it OOM errors. Trying to load TheBloke/deepseek-llm-67b-chat-AWQ. My setup is a 4090 (GPU0) and 3090 (GPU1).
yep still not fixed, ignoring MiB limits in the webui and using the wrong gpu
I’m not sure how it’s implemented in ooba but AutoAWQ is compatible with using a device map which you can use to set which layers goes where
I don't understand on why It was decided to Remove AWQ as it was working fine albeit a few tweaks, I can see reason for a few other things that are Bugged vs AWQ loader.
Description
I have created AutoAWQ as a package to more easily quantize and run inference for AWQ models. I wish to have AutoAWQ integrated into text-generation-webui to make it easier for people to use AWQ quantized models.
It works with a wide range of models and runs fast when you use a good CPU+GPU combination:
Additional Context
Package: https://github.com/casper-hansen/AutoAWQ Vicuna AWQ: https://huggingface.co/casperhansen/vicuna-7b-v1.5-awq MPT AWQ: https://huggingface.co/casperhansen/mpt-7b-8k-chat-gptq