johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Version of GPTQ #104

Open juanps90 opened 1 year ago

juanps90 commented 1 year ago

Some users on Reddit are reporting that a new version of GPTQ-for-LLaMA is providing better performance.

I wonder if this is better than the version used by this repo and compatible?

johnsmith0031 commented 1 year ago

Thanks! I'll look into it

johnsmith0031 commented 1 year ago

tested it and found 50% performance up. I think I'll implement triton quant_atten and fused_mlp with lora support on this repo.

Ph0rk0z commented 1 year ago

I finally tried the cuda quant attention by itself. fused_mlp doesn't help at all. On P6000, it is fastest inference, even vs auto_gptq with the same. Should have checked it sooner.

autograd/quant_atten Output generated in 2.69 seconds (6.70 tokens/s, 18 tokens, context 66, seed 1725726905)

vs

AutoGPTQ

Output generated in 3.27 seconds (5.50 tokens/s, 18 tokens, context 66, seed 59570363)

Triton is consistently more slow on my ampere cards.