EXL2 format spec? - Githubissues

polarathene commented 3 months ago

Is the EXL2 format documented anywhere like GGUF is? Or is it only intended to be used with this project?

turboderp commented 3 months ago

The file format is safetensors, and the overall structure of an EXL2 model is the same as the source HF model that it's converted from. A few "special" architectures use a little bit of key remapping to try to massage everything into a similar shape so I can share as much code as possible between them. Also a couple of architectures that use tied embeddings have their embedding layer copied to produce an output layer that can be quantized so the FP16 embeddings can remain in system RAM.

Aside from that, the difference between a HF model and an EXL2 model is more or less just that most of the .weight tensors have been replaced by the following, to represent a [k, n] row-major matrix, with g groups:

.q_invperm uint16[k]: inverse row permutation (equivalent to the argsort of a GPTQ .g_idx tensor)
.q_scale uint4[g, n] packed as uint32[g, n/8]: quantized 4-bit group scales
.q_scale_max half[n]: maximum scale per output feature
.q_groups uint16[2g]: one pair of (group_bits, group_size) per group
.q_group_map uint16[2k]: precalculated group index, only used for thread blocks to quickly index into k
.q_weights uint_variable[k, n] packed as uint32[k', n]: the quantized weights. Each group of rows looks like a slice of a GPTQ tensor of that bitrate, but the bitrate varies across groups, hence k' is a little complicated to work out. There's also a unique shuffle operation per bitrate < 8, see qdq_*.h here.

I wrote a bit about it here, specifically how it relates to the matmul kernel but covering most of how the scale is quantized, what the group map is and stuff like that.

turboderp commented 3 months ago

I should probably clarify that the shuffle is only applied at load time.

Also need to close some of these issues. If there's anything else, lmk.

polarathene commented 3 months ago

No worries! Thanks for sharing those insights, I passed them on to someone who better understands the technical details for implementing EXL2 loading support 👍

turboderp / exllamav2

EXL2 format spec? #494