Closed roseodimm closed 9 months ago
AutoAWQ is not vram efficient.
ExLlamav2 uses INT8 cache which makes it more memory efficient. AutoAWQ may do so in the future and may also use flash decoding instead of the current kernels.
I have a similar issue, I have 2 GPU devices, a 24GB (device0) and an 8GB (device1). When attempting to load an AWQ it does not respect --gpu-memory setting or via the web ui. It always attempts to split the model symetrically across the 2 cards and hits "torch.cuda.OutOfMemoryError: CUDA out of memory" as it attempts to load too much into the smaller device.
Have you tried this on the newest version?
pip install autoawq==0.1.7
If you set the device_map
argument on AutoAWQForCausalLM.from_quantized
, it will respect the argument when loading.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Describe the bug
I can run 20B and 30B GPTQ model with ExLlama_HF alpha_value = 1 compress_pos_emb = 1 max_seq_len = 4096 20B Vram 4,4,8,8 result 9-14 token per sec 30B Vram 2,2,8,8 result 4-6 token per sec
When switch AutoAWQ mode for AWQ version of the same model. cpu-memory in MiB = 0 max_seq_len = 4096 20B Vram 4096,4096,8190,8190 no_inject_fused_attention result "Cuda out of memory." after half half of the generate. 20B Vram 2048,2048,8190,8190 no_inject_fused_attention result 1-2 token per sec I can't run the model with no_inject_fused_attention un-check.
Did i do something wrong? Why AutoAWQ give me a bad result compare to ExLlama_HF.
I use RTX 2070s 8gb x4.
Is there an existing issue for this?
Reproduction
For GPTQ alpha_value = 1 compress_pos_emb = 1 max_seq_len = 4096 20B Vram 4,4,8,8 result 9-14 token per sec 30B Vram 2,2,8,8 result 4-6 token per sec
For AWQ cpu-memory in MiB = 0 max_seq_len = 4096 20B Vram 4096,4096,8190,8190 no_inject_fused_attention result "Cuda out of memory." 20B Vram 2048,2048,8190,8190 no_inject_fused_attention result 1-2 token per sec
Screenshot
No response
Logs
System Info