turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

Quantizing goliath120b @ 3bpw : calibration perplexity (quant): 2745.1239 #246

Closed alexconstant9108 closed 6 months ago

alexconstant9108 commented 6 months ago

Hi, I tried the new quant method (master branch) with goliath 120b using the built-in calibration dataset (not specifying -c parameter). -b 3.0 -hb 8 -rs 1.0 # Module quantized, calibration perplexity (quant): 2745.1239. The model outputs only garbage tokens when loaded. Example: teksttekteksttekstteksttekcesstekcesstekcesstekcesstekcesstekcesstekce... I also tried quantizing with a higher bpw but got similar results. Is it a rope scaling issue or tokenizing or something else with the new quant method?

Another test - quantized deepseek-coder-33b -b 4.0 -hb 8 -rs 4.0 # Rope Scale 4: Module quantized, calibration perplexity (quant): 12.5544 The perplexity is very high again, but at least is NOT > 2000 :D The output of the model looks OK. Note that I used rope scaling 4.0 this time because I know this is what DS Coder expects.

What could be the problem with quantizing goliath 120b?

turboderp commented 6 months ago

I'm not sure. It's not a very well-behaved model necessarily, being merged together from two 70B models in a highly unscientific process. Do you have a console log from the conversion?

alexconstant9108 commented 6 months ago

Most of the history got lost but toward the end, it's the following:

-- Layer: model.layers.134 (MLP)
 -- Linear: model.layers.134.mlp.gate_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.12 bpw
 -- Linear: model.layers.134.mlp.up_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.12 bpw
 -- Linear: model.layers.134.mlp.down_proj -> 0.05:6b_32g/0.2:3b_64g/0.75:2b_64g s4, 2.47 bpw
 -- Module quantized, rfn_error: 0.005277
 -- Layer: model.layers.135 (Attention)
 -- Linear: model.layers.135.self_attn.q_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 -- Linear: model.layers.135.self_attn.k_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 -- Linear: model.layers.135.self_attn.v_proj -> 0.1:4b_128g/0.9:3b_128g s4, 3.14 bpw
 -- Linear: model.layers.135.self_attn.o_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 -- Module quantized, rfn_error: 0.000982
 -- Layer: model.layers.135 (MLP)
 -- Linear: model.layers.135.mlp.gate_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.12 bpw
 -- Linear: model.layers.135.mlp.up_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.12 bpw
 -- Linear: model.layers.135.mlp.down_proj -> 0.05:6b_32g/0.2:3b_64g/0.75:2b_64g s4, 2.47 bpw
 -- Module quantized, rfn_error: 0.005241
 -- Layer: model.layers.136 (Attention)
 -- Linear: model.layers.136.self_attn.q_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 -- Linear: model.layers.136.self_attn.k_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 -- Linear: model.layers.136.self_attn.v_proj -> 0.1:4b_128g/0.9:3b_128g s4, 3.14 bpw
 -- Linear: model.layers.136.self_attn.o_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 -- Module quantized, rfn_error: 0.000556
 -- Layer: model.layers.136 (MLP)
 -- Linear: model.layers.136.mlp.gate_proj -> 0.1:4b_128g/0.9:3b_128g s4, 3.14 bpw
 -- Linear: model.layers.136.mlp.up_proj -> 0.25:4b_128g/0.75:3b_128g s4, 3.28 bpw
 -- Linear: model.layers.136.mlp.down_proj -> 0.05:8b_32g/0.1:4b_128g/0.85:3b_128g s4, 3.39 bpw
 -- Module quantized, rfn_error: 0.004609
 -- Layer: model.norm (RMSNorm)
 -- Module quantized, rfn_error: 0.000000
 -- Layer: lm_head (Linear)
 -- Linear: lm_head -> 1.0:8b_128g s4, 8.03 bpw
 -- Module quantized, calibration perplexity (quant): 2745.1239
 -- Compiling output file...
turboderp commented 6 months ago

I suppose so, yes. You could also maybe try the model_diff.py script on the original and quantized models.

python model_diff.py -ma <original_dir> -mb <quantized_dir> -ed <wikitext-test-or-whatever.parquet>

alexconstant9108 commented 6 months ago

So far the rfn_error seems to be steadily climbing up: [Edit: went down a bit at layers.16 (MLP) and model.layers.17 (Attention)]

wikitext-103-v1-test.parquet
 -- Model A: ../goliath120b
 -- Model B: ../goliath120bMyQuant/3.0bpw/
 -- Loading tokenizer
 -- Tokenizing eval data
 -- First 50 tokens of dataset:
    ' = Robert Boulter = \n  Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
 -- Last 50 tokens of dataset:
    'the brink of defeat . \n  By this time , the US 2nd Infantry Division suffered 1 @,@ 120 killed , 2 @,@ 563 wounded , 67 captured and 6'
 -- Embeddings
 -- model.layers.0 (Attention)               rfn_error: 0.015922
 -- model.layers.0 (MLP)                     rfn_error: 0.027177
 -- model.layers.1 (Attention)               rfn_error: 0.031267
 -- model.layers.1 (MLP)                     rfn_error: 0.032254
 -- model.layers.2 (Attention)               rfn_error: 0.033474
 -- model.layers.2 (MLP)                     rfn_error: 0.017809
 -- model.layers.3 (Attention)               rfn_error: 0.020256
 -- model.layers.3 (MLP)                     rfn_error: 0.023616
 -- model.layers.4 (Attention)               rfn_error: 0.026043
 -- model.layers.4 (MLP)                     rfn_error: 0.029989
 -- model.layers.5 (Attention)               rfn_error: 0.031825
 -- model.layers.5 (MLP)                     rfn_error: 0.036167
 -- model.layers.6 (Attention)               rfn_error: 0.038758
 -- model.layers.6 (MLP)                     rfn_error: 0.042597
 -- model.layers.7 (Attention)               rfn_error: 0.044790
 -- model.layers.7 (MLP)                     rfn_error: 0.045228
 -- model.layers.8 (Attention)               rfn_error: 0.046283
 -- model.layers.8 (MLP)                     rfn_error: 0.035437
 -- model.layers.9 (Attention)               rfn_error: 0.036156
 -- model.layers.9 (MLP)                     rfn_error: 0.038263
 -- model.layers.10 (Attention)              rfn_error: 0.038777
 -- model.layers.10 (MLP)                    rfn_error: 0.041209
 -- model.layers.11 (Attention)              rfn_error: 0.042014
 -- model.layers.11 (MLP)                    rfn_error: 0.042669
 -- model.layers.12 (Attention)              rfn_error: 0.043561
 -- model.layers.12 (MLP)                    rfn_error: 0.044494
 -- model.layers.13 (Attention)              rfn_error: 0.046362
 -- model.layers.13 (MLP)                    rfn_error: 0.046964
 -- model.layers.14 (Attention)              rfn_error: 0.048840
 -- model.layers.14 (MLP)                    rfn_error: 0.049265
 -- model.layers.15 (Attention)              rfn_error: 0.051290
 -- model.layers.15 (MLP)                    rfn_error: 0.051815
 -- model.layers.16 (Attention)              rfn_error: 0.052383
 -- model.layers.16 (MLP)                    rfn_error: 0.044537
 -- model.layers.17 (Attention)              rfn_error: 0.044996
 -- model.layers.17 (MLP)                    rfn_error: 0.045652
 -- model.layers.18 (Attention)              rfn_error: 0.045929
 -- model.layers.18 (MLP)                    rfn_error: 0.047184
 -- model.layers.19 (Attention)              rfn_error: 0.047517
 -- model.layers.19 (MLP)                    rfn_error: 0.048966
 -- model.layers.20 (Attention)              rfn_error: 0.049343
 -- model.layers.20 (MLP)                    rfn_error: 0.050308
 -- model.layers.21 (Attention)              rfn_error: 0.051239
 -- model.layers.21 (MLP)                    rfn_error: 0.052267
 -- model.layers.22 (Attention)              rfn_error: 0.053304
 -- model.layers.22 (MLP)                    rfn_error: 0.054426
 -- model.layers.23 (Attention)              rfn_error: 0.055847
 -- model.layers.23 (MLP)                    rfn_error: 0.056801
 -- model.layers.24 (Attention)              rfn_error: 0.057731
 -- model.layers.24 (MLP)                    rfn_error: 0.057488
 -- model.layers.25 (Attention)              rfn_error: 0.058635
 -- model.layers.25 (MLP)                    rfn_error: 0.058085
 -- model.layers.26 (Attention)              rfn_error: 0.059225
 -- model.layers.26 (MLP)                    rfn_error: 0.058937
 -- model.layers.27 (Attention)              rfn_error: 0.060438
 -- model.layers.27 (MLP)                    rfn_error: 0.060675
 -- model.layers.28 (Attention)              rfn_error: 0.061720
 -- model.layers.28 (MLP)                    rfn_error: 0.062242
 -- model.layers.29 (Attention)              rfn_error: 0.063532
 -- model.layers.29 (MLP)                    rfn_error: 0.064182
 -- model.layers.30 (Attention)              rfn_error: 0.064893
 -- model.layers.30 (MLP)                    rfn_error: 0.065751
 -- model.layers.31 (Attention)              rfn_error: 0.066470
 -- model.layers.31 (MLP)                    rfn_error: 0.067452
 -- model.layers.32 (Attention)              rfn_error: 0.068474
 -- model.layers.32 (MLP)                    rfn_error: 0.068648
 -- model.layers.33 (Attention)              rfn_error: 0.069634
 -- model.layers.33 (MLP)                    rfn_error: 0.069190
 -- model.layers.34 (Attention)              rfn_error: 0.070316
 -- model.layers.34 (MLP)                    rfn_error: 0.070618
 -- model.layers.35 (Attention)              rfn_error: 0.071625
 -- model.layers.35 (MLP)                    rfn_error: 0.072434
 -- model.layers.36 (Attention)              rfn_error: 0.073511
 -- model.layers.36 (MLP)                    rfn_error: 0.074606
 -- model.layers.37 (Attention)              rfn_error: 0.075349
 -- model.layers.37 (MLP)                    rfn_error: 0.076719
 -- model.layers.38 (Attention)              rfn_error: 0.077374
 -- model.layers.38 (MLP)                    rfn_error: 0.078987
 -- model.layers.39 (Attention)              rfn_error: 0.079390
 -- model.layers.39 (MLP)                    rfn_error: 0.079783
 -- model.layers.40 (Attention)              rfn_error: 0.080729
 -- model.layers.40 (MLP)                    rfn_error: 0.080540
 -- model.layers.41 (Attention)              rfn_error: 0.081597
 -- model.layers.41 (MLP)                    rfn_error: 0.081705
 -- model.layers.42 (Attention)              rfn_error: 0.082440
 -- model.layers.42 (MLP)                    rfn_error: 0.082805
 -- model.layers.43 (Attention)              rfn_error: 0.083692
 -- model.layers.43 (MLP)                    rfn_error: 0.084192
 -- model.layers.44 (Attention)              rfn_error: 0.085154
 -- model.layers.44 (MLP)                    rfn_error: 0.085747
 -- model.layers.45 (Attention)              rfn_error: 0.086519
 -- model.layers.45 (MLP)                    rfn_error: 0.087305
 -- model.layers.46 (Attention)              rfn_error: 0.088304
 -- model.layers.46 (MLP)                    rfn_error: 0.089137
 -- model.layers.47 (Attention)              rfn_error: 0.089982
 -- model.layers.47 (MLP)                    rfn_error: 0.090458
turboderp commented 6 months ago

I'm currently trying to reproduce it here, and that will take some hours to complete. But that's the cumulative error between the original and the quantized model. It's expected to go up a bit and then stabilize or decrease after a while. At least up until layer 17 there's no indication that anything has gone wrong.

It could be in the head layer, I suppose. The -hb argument is supposed to be integer but I guess argparse is just rounding down automatically. The 8-bit head should be fine, but I haven't tested it as extensively. I'll know more once I have a quantized model and I can run some tests on it.

turboderp commented 6 months ago

Okay, so I did the same conversion here, to 3.0 bpw, both with -h 6 and -h 8, and I'm not seeing any issues on my end.

-- Module quantized, calibration perplexity (quant): 6.2759

The model also seems to work fine afterwards:

(q4) [bb@bbc exllamav2]$ python test_inference.py -m /mnt/str/models/goliath-120b-exl2/3.0bpw_h8/ -p "Once upon a time," -ed /mnt/str/datasets/wikitext-test.parquet -er 10 -gs auto
 -- Model: /mnt/str/models/goliath-120b-exl2/3.0bpw_h8/
 -- Options: ['gpu_split: auto', 'rope_scale: 1.0', 'rope_alpha: 1.0']
 -- Loading tokenizer...
 -- Loading model...
 -- Warmup...
 -- Generating...

Once upon a time, there was a little girl named Goldilocks. She had three bears who were her best friends: Papa Bear, Mama Bear, and Baby Bear. One day, Goldilocks went for a walk in the forest with her three bear friends.

As they walked, they came across a river. "Oh no!" exclaimed Goldilocks. "How are we going to cross this river?"

Papa Bear, being the strongest, said, "Don't worry, I'll carry you all on my back." So, he carefully carried Goldilocks and the three be

 -- Response generated in 8.05 seconds, 128 tokens, 15.89 tokens/second (includes prompt eval.)
 -- Running perplexity test
 -- Dataset: /mnt/str/datasets/wikitext-test.parquet
 -- Tokenizing eval data, 10 rows x 2048 tokens...
 -- First 50 tokens of dataset:
    'Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starin role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons'
 -- Last 50 tokens of dataset:
    'lost earlier in the day and sank Chiyoda . He ordered thedaurth Carrier Division to reverse course and engage the Americans , but the datleships were unable to find them , and Ozawa ordered em to reverse course'
 -- Inference.
 -- Evaluation perplexity: 6.4392

I did identify one place where the quantizer might run out of memory and interpret that as a math problem. I'll fix that with a commit in a moment. But it still shouldn't be able to recover from that state and proceed to quantize the model incorrectly.

Are you sure the FP16 version of your model isn't somehow corrupted? I converted this one for reference. The model_diff.py script should also output perplexity for the unquantized model at the end.

turboderp commented 6 months ago

model_diff shouldn't use a lot of VRAM, so that's a little suspicious too. It also only takes like 5 minutes to run (on my 4090), so maybe something's up with that. At any rate, I've uploaded the measurement here. I will also upload the conversion as soon as I have available bandwidth for it.

I checked the MD5 hashes of the files I downloaded yesterday, and it does look like there's a difference:

f71f0eed982443645b3e92f97369c2a6  ./model-00020-of-00024.safetensors

Vs yours:

572ade9c14aa6bf963a3d4832eb97fd8  ./model-00020-of-00024.safetensors

All the other files match. My copy also matches the SHA256 sum from here, so it looks like your download failed somehow, just enough to still yield a valid safetensors file.

alexconstant9108 commented 6 months ago

Redownloading the 20th partition solved the issue. Thanks!