turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

Quantization Error (2) #305

Closed 152334H closed 5 months ago

152334H commented 5 months ago

Observed when attempting to quantize alpindale/miqu-1-70b-fp16

$ python convert.py -i ~/.cache/huggingface/hub/models--alpindale--miqu-1-70b-fp16/snapshots/f8267dda117a9192ed42ad31aba3c4b4e1fb9907/ -o ./miqu-3.0bpw -b 3.0
<... some output I didn't fully log ...>
 --   model.layers.78.mlp                                3.3595 bpw - exp. error: 0.01445030
 --   model.layers.79.self_attn                          2.2185 bpw - exp. error: 0.00499376
 --   model.layers.79.mlp                                2.5856 bpw - exp. error: 0.01804905
 -- Tokenizing samples...
 -- Token embeddings again...
 -- Quantizing...
 -- Layer: model.layers.0 (Attention)
 -- Linear: model.layers.0.self_attn.q_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 ## Quantization error (2)

The output directory is like this:

$ tree miqu-3.0bpw/
miqu-3.0bpw/
├── cal_data.safetensors
├── hidden_states.safetensors
├── job_new.json
├── measurement.json
└── out_tensor
    └── model.layers.0.self_attn.q_proj.safetensors

1 directory, 5 files

When retrying to resume:

 -- Resuming job
 !! Note: Overriding options with settings from existing job
 -- Input: /home/user/.cache/huggingface/hub/models--alpindale--miqu-1-70b-fp16/snapshots/f8267dda117a9192ed42ad31aba3c4b4e1fb9907/
 -- Output: ./miqu-3.0bpw
 -- Using default calibration dataset
 -- Target bits per weight: 3.0 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Quantizing...
 -- Layer: model.layers.0 (Attention)
 -- Linear: model.layers.0.self_attn.q_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 ## Quantization error (2)

This is caused by this line of exllamav2. I'm not sure why it happens.

turboderp commented 5 months ago

I've been working on that model for a bit, and it appears to be a broken conversion, though I can't figure out what's broken about it. Yet. FP16 inference on wikitext gives really high perplexity even before quantizing.

The error during quantization is because it catastrophically fails to quantize and then reconstruct the first layer, which suggests out-of-bounds values or some such, maybe even in the RMS norm weights or the embeddings. I'm looking into it.

152334H commented 5 months ago

my suspicion is that the conversion from fp16 gguf -> fp16 pytorch is broken on the attn weights specifically, due to unaddressed permutations from the pytorch -> gguf conversion.

i don't think the rest of the weights can be bad. Especially e.g. the embeddings have to be shaped correct because the pytorch model still generates semi-relevant text given an input prompt.

turboderp commented 5 months ago

It turns out the weights were indeed broken, and there are now correct FP16 (and EXL2) versions on HF.