difference from gptq when inferring

mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

MIT License

2.38k stars 184 forks source link

difference from gptq when inferring #127

Open frankxyy opened 10 months ago

frankxyy commented 10 months ago

Hi, is there any difference when infering a awq quantized model with that of a gptq quantized model. It seems no difference there?

fxmarty commented 10 months ago

They are the same for act_order=False - just the packing is different. So AWQ kernels & exllama/exllamav2 kernels are essentially doing the same thing.

frankxyy commented 10 months ago

@fxmarty Hi, it seems that act_order option just work for quantization time for gptq. In inference time there is no act_order behavior. Am I right?

fxmarty commented 10 months ago

@frankxyy that I know of, the quantization yields a g_idx ordering tensor. The best strategy then with act_order that I know of is to:

Reorder in advance weights, scales, zero points.
Reorder on the fly the activations during inference.

This is the strategy in exllama/exllamav2.

frankxyy commented 10 months ago

@fxmarty Strategy 2 seems too time-consuming. Currently from my knowing of exllama, there seems to be no activation reordering.

fxmarty commented 10 months ago

Oh, 1 and 2 go together. For reference https://github.com/turboderp/exllama/issues/95#issuecomment-1606199301

frankxyy commented 10 months ago

@fxmarty Got it, I think I misunderstood the codebase