Open frankxyy opened 10 months ago
They are the same for act_order=False - just the packing is different. So AWQ kernels & exllama/exllamav2 kernels are essentially doing the same thing.
@fxmarty Hi, it seems that act_order option just work for quantization time for gptq. In inference time there is no act_order behavior. Am I right?
@frankxyy that I know of, the quantization yields a g_idx
ordering tensor. The best strategy then with act_order that I know of is to:
This is the strategy in exllama/exllamav2.
@fxmarty Strategy 2 seems too time-consuming. Currently from my knowing of exllama, there seems to be no activation reordering.
Oh, 1 and 2 go together. For reference https://github.com/turboderp/exllama/issues/95#issuecomment-1606199301
@fxmarty Got it, I think I misunderstood the codebase
Hi, is there any difference when infering a awq quantized model with that of a gptq quantized model. It seems no difference there?