turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

what does `make_sequential` do when using gptq inference? #454

Open sleepwalker2017 opened 1 month ago

sleepwalker2017 commented 1 month ago

I run gptq using auto-gptq, and it calls exllamaV2 kernel.

I find it seems to change the layout of q_weight, but I checked the map and map_iv, the are just mapped to themselves in GPTQ.

And after running the kernel, the weight has no change.

Is that redundant ? Or it's used in other quantization methods?

turboderp commented 1 month ago

You're probably looking at models quantized without act-order or without group size.

make_sequential rearranges the rows (i.e. the input features) of the weights tensor to correspond to the activation order used when quantizing that matrix.

This activation order is stored in the .g_idx tensor for each weight matrix. For non-act-order models this index is trivial and redundant, but for act-order models it maps each row of weights to a group. A group defines the grid (scale and zero-point) for weights within it, so it's problematic when neighboring rows don't belong to the same group. This gives the best accuracy with GPTQ, but it also means that to dequantize a column slice of 128 weights, you'll need 128 lookups into the .scales and .qzeros tensors. These lookups don't coalesce and become a severe bottleneck.

Instead, ExLlama pre-applies the act-order permutation to the weights (in make_sequential) so the quantized weights are stored in global memory in the same order in which they were quantized (i.e. the order in which they were grouped together). For a group size 128 model, then, rows 0-127 will all belong to group 0, 128-255 belong to group 1, etc., and the bottleneck goes away.

As long as the input tensor is permuted as well, the matmul output will be the same, since a[:, perm] @ b[perm, :] == a @ b. Applying this permutation in the matmul kernel is very cheap since the inputs tend to be very small, so it's a good tradeoff.

sleepwalker2017 commented 1 month ago

same

Thank you for the detailed explanation!

I need to dive into the code to totally understand it.

BTW, I'm reading a paper which uses a special method to solve the bank conflict when doing de-quantization. https://arxiv.org/abs/2402.10076

I wonder if you have read it.

I'm very confused why de-quantization introduces bank conflicts.

I profiled the reconstruct_gptq_kernel and find it does introduce a lot of bank conflicts. Confused with that.