Closed polarathene closed 3 months ago
The file format is safetensors, and the overall structure of an EXL2 model is the same as the source HF model that it's converted from. A few "special" architectures use a little bit of key remapping to try to massage everything into a similar shape so I can share as much code as possible between them. Also a couple of architectures that use tied embeddings have their embedding layer copied to produce an output layer that can be quantized so the FP16 embeddings can remain in system RAM.
Aside from that, the difference between a HF model and an EXL2 model is more or less just that most of the .weight
tensors have been replaced by the following, to represent a [k, n] row-major matrix, with g groups:
.q_invperm
uint16[k]: inverse row permutation (equivalent to the argsort of a GPTQ .g_idx
tensor).q_scale
uint4[g, n] packed as uint32[g, n/8]: quantized 4-bit group scales.q_scale_max
half[n]: maximum scale per output feature.q_groups
uint16[2g]: one pair of (group_bits, group_size) per group.q_group_map
uint16[2k]: precalculated group index, only used for thread blocks to quickly index into k.q_weights
uint_variable[k, n] packed as uint32[k', n]: the quantized weights. Each group of rows looks like a slice of a GPTQ tensor of that bitrate, but the bitrate varies across groups, hence k' is a little complicated to work out. There's also a unique shuffle operation per bitrate < 8, see qdq_*.h here.I wrote a bit about it here, specifically how it relates to the matmul kernel but covering most of how the scale is quantized, what the group map is and stuff like that.
I should probably clarify that the shuffle is only applied at load time.
Also need to close some of these issues. If there's anything else, lmk.
No worries! Thanks for sharing those insights, I passed them on to someone who better understands the technical details for implementing EXL2 loading support 👍
Is the EXL2 format documented anywhere like GGUF is? Or is it only intended to be used with this project?