turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.68k stars 282 forks source link

Inquiring about Calibration Procedures and Issues for Model Using Specified Dataset #262

Closed MatrixC7 closed 10 months ago

MatrixC7 commented 10 months ago

Context and Issue

I'm attempting to quantize the model alpindale/goliath-120b using royallab/PIPPA-cleaned for role-play applications, employing exllamav2 (commit 26ffee3). The quantization is done in 2 steps:

# To get the measurement of the model with a certain dataset for calibration
python.exe .\convert.py -i "F:\goliath-120b" -o "F:\llm-models-exl2\temp" -nr -om "F:\llm-models-exl2\goliath-120b-rpcal-measurement.json" -c "F:\llm-models-exl2\PIPPA-cleaned\pippa_raw_fix.parquet"

# Use the measurement to quantize the model to 2.4 bpw
python.exe .\convert.py -i "F:\goliath-120b" -o "F:\llm-models-exl2\temp" -nr -m "F:\llm-models-exl2\goliath-120b-rpcal-measurement.json" -cf "F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2" -b 2.4 -c "F:\llm-models-exl2\PIPPA-cleaned\pippa_raw_fix.parquet"

The quantization process is smooth, with the calibration perplexity reported as Module quantized, calibration perplexity (quant): 7.5812. However, recalculating perplexity with a dataset like wikitext yields significantly higher values.

(exllamav2) PS F:\exllamav2> python.exe .\test_inference.py -m 'F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\' -ed F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet -gs 18,24
 -- Model: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\
 -- Options: ['gpu_split: 18,24']
 -- Loading model...
 -- Loading tokenizer...
 -- Running perplexity test
 -- Dataset: F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet
 -- Tokenizing eval data, 128 rows x 2048 tokens...
 -- First 50 tokens of dataset:
    ' = Robert Boulter = \n  Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
 -- Last 50 tokens of dataset:
    'was secured by community activists for the first time on 5 January 1969 following an incursion into the <unk> by members of the Royal Ulster Constabulary ( RUC ) . Residents built barric'
 -- Inference.............
 -- Evaluation perplexity: 141.4893

(exllamav2) PS F:\exllamav2> python.exe .\test_inference.py -m 'F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\' -ed F:\wikitext\wikitext-2-v1\test-00000-of-00001.parquet -gs 18,24
 -- Model: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\
 -- Options: ['gpu_split: 18,24']
 -- Loading model...
 -- Loading tokenizer...
 -- Running perplexity test
 -- Dataset: F:\wikitext\wikitext-2-v1\test-00000-of-00001.parquet
 -- Tokenizing eval data, 128 rows x 2048 tokens...
 -- First 50 tokens of dataset:
    ' = Robert <unk> = \n  Robert <unk> is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
 -- Last 50 tokens of dataset:
    'his 1999 <unk> Treatment of Palm <unk> <unk> , American botanist Sidney F. <unk> divided the group into five genera — a more narrowly defined <unk> , <'
 -- Inference.............
 -- Evaluation perplexity: 39.3915

Perplexity only remains close to 7.5812 when using the original dataset as 8.4998.

(exllamav2) PS F:\exllamav2> python.exe .\test_inference.py -m 'F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\' -ed F:\llm-models-exl2\PIPPA-cleaned\pippa_raw_fix.parquet -gs 18,24
 -- Model: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\
 -- Options: ['gpu_split: 18,24']
 -- Loading model...
 -- Loading tokenizer...
 -- Running perplexity test
 -- Dataset: F:\llm-models-exl2\PIPPA-cleaned\pippa_raw_fix.parquet
 -- Tokenizing eval data, 128 rows x 2048 tokens...
 -- First 50 tokens of dataset:
    'You are now roleplaying as Kamisato Ayaka. Kamisato Ayaka can be described as such: She is Akane the maid of Ayaka. One day Ayaka found a gold ring that grant all the wishes of its'
 -- Last 50 tokens of dataset:
    'highness” "It will, I will make sure of it."\n\nThe matriarch gave Diego a kind smile, before disappearing from his view. Diego could feel his presence slipping away from him, he was about to be free'
 -- Inference.............
 -- Evaluation perplexity: 8.4998

And comparisons between the original and quantized models also lead to errors.

(exllamav2) PS F:\exllamav2> python.exe .\model_diff.py -ma F:\goliath-120b\ -mb F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\ -ed F:\wikitext\wikitext-2-v1\test-00000-of-00001.parquet
 -- Model A: F:\goliath-120b\
 -- Model B: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\
 -- Loading tokenizer
 -- Tokenizing eval data
 -- First 50 tokens of dataset:
    ' = Robert <unk> = \n  Robert <unk> is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
 -- Last 50 tokens of dataset:
    'by September 9 , the division continued to harass rear areas around <unk> with infiltrating groups as large as companies . <unk> daily had to open the main supply road and clear the town . \n  North Korean and'
Traceback (most recent call last):
  File "F:\exllamav2\model_diff.py", line 70, in <module>
    attn_mask = model[0].build_attn_mask(1, seq_len, 0, None, "cuda:0")
                ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ExLlamaV2' object has no attribute 'build_attn_mask'

Inquiries

  1. Quantization Correctness: Is my approach to dataset-specific calibrated quantization appropriate? If not, could you guide me on the correct method?

  2. Perplexity Variance: I understand why perplexity might be low with the specified dataset as it is calibrated to, but why does it go that large with wikitext? Is the dataset-specific calibrated quantization supposed to be that?

  3. Calibration Mechanism: Could you provide a brief explanation of the underlying mechanics of dataset calibration? And if the observed behavior is expected, why so?

  4. Default Calibration Dataset: What is the default calibration dataset used? For perplexity calculation with wikitext, which subset should I utilize: test, train, or validation?

  5. Unknown Parameters: In some online code snippets,

    python3 convert.py -i <path to model> -o <path to working directory> -nr -om <model name>_measurement.json -mr 10 -gr 10 -c <parquet dataset file> && python3 convert.py -i <path to model> -o <path to working directory> -nr -m <model name>_measurement.json -b 4.85 -gr 40 -c <same parquet dataset file> -cf <model name>-exl2-4.85bpw

    I've encountered parameters like -mr and -gr which aren't documented in the manual. Could you explain their functions and usage, possibly with examples?

  6. Additional Parameters in text_inference.py: Are there hidden parameters similar to -mr and -gr in text_inference.py?

  7. Comparison Errors: What might be causing the errors during model comparison, and how can I resolve them?

Attempts and Documentation

Therefore I feel that I need your assistance. I deeply appreciate your time and assistance in addressing these queries and issues.

turboderp commented 10 months ago

First off, Goliath is a bit of a wildcard, because it's a Frankenstein model. It's two 70B finetunes sliced up and glued back together, so it's hard to speculate as to why it works when it does, or what might make it fail to perform well under extreme circumstances like heavy quantization.

As for your questions:

  1. You're using the script correctly from what I can tell. But whether it's appropriate in the first place to use a task-specific calibration set is another matter. In theory it makes sense: you want the model to be better at reproducing internal states it would experience while doing RP, and calibrating with a dataset like Pippa might in theory do that. But the more aggressive the quantization, the more heavily the quantizer is going to rely on the calibration data, to the point where at 2.4bpw it might be overfitting somewhat.

  2. The perplexity is going to vary with the difference between the calibration dataset and the test dataset, though I generally don't see it varying this much. Values as high as 141 do suggest that the model is working, but obviously the results will be unacceptable. Still, this is different from results in the thousands or millions, which would be the case if the weights had been completely scrambled due to an outright bug in the code. So I would assume quantization is working "as intended" but you've hit a limit with that combination of model, dataset and bitrate.

  3. The calibration process is the same as what's described in the GPTQ paper, which is a variation on OBQ.

The basic idea is that each quantized matrix is produced not by rounding the floating-point weights to the nearest point on a quantization grid, but instead to treat it as a reconstruction problem: Find the quantized matrix that does the same thing as the original, with the smallest possible error with respect to a representative sample of all the actual input states that the matrix is going to be working on in inference.

For a high enough bitrate, the solution is trivial, but as you lower the bitrate the solution starts to rely more and more on correlations between weights and patterns in the input. E.g. if the input data always shows two parameters perfectly correlating, the corresponding weights can "cooperate" to project those two coordinates more precisely.

The idea is that you're going to have an error one way or another, but you can concentrate the error in the unimportant features, shifting it away from the important ones. But that only works as long as the calibration data is actually representative, and as long as it remains representative throughout the forward pass. Otherwise you end up trying to navigate the space that you shifted the error into, and that's obviously problematic.

  1. The default dataset is intended to be more general and avoid the risk of overfitting or overlooking important features. It contains a mix of different languages, code, legal text, technical text, prose and more. It has rows starting with a BOS token and some without, and it even has some amount of completely random tokens.

The dataset is built here for reference. The reason it's not just a .parquet file is that it's tokenizer-aware (you can't get random token IDs from tokenizing random text, for instance).

I've found this to produce much more stable conversions overall. To be clear, it's not that you can't use a custom dataset. But the more specialized it is, and the more aggressive the quantization, the more you're going to also specialize the quantized model. And I can't give you any pointers on how to recognize when a dataset is too narrow, because I just don't know at the end of the day. Especially since it likely depends on the model undergoing conversion, and as mentioned Goliath is a wildcard.

  1. -mr is documented here and -gr was removed around version 0.0.10.

  2. You can run python test_inference.py -h for a complete list of arguments. They're not thoroughly documented, simply because I don't have the time to keep an updated reference. They also change from time to time since a lot of it is just for experimentation.

  3. The problem you're having with model_diff.py is due to me trying to work around some shortcomings of flash-attn. I made some changes to the attention interface and just forgot to update that particular script. It's fixed with the latest commit.


I noticed in your tests there are a lot of instances of <unk> in one of your wikitext datasets. I'm not sure what that's about but it would probably be a good idea to confirm that the dataset hasn't been incorrectly converted at some point. Otherwise it could indicate a tokenization problem with ExLlamaV2.

Anyway, I converted Goliath-120B to 2.4bpw with the built-in dataset, and here are the results I'm getting:

bpw wiki (128 rows) c4 (20 rows) pippa (20 rows)
2.4 8.62 6.83 7.42
3.0 7.95 6.29 7.59
FP16 6.89 5.91 8.03

The results for Pippa are anomalous and perhaps that's worth looking into. But again, Goliath is not expected to be a well-behaved model, so there are a lot of variables to consider. Overall though the results look fairly reasonable, I think.

I'm uploading the 2.4bpw model here but with my upload bandwidth it will take several hours. Once it's done you could use that for reference, I guess.

MatrixC7 commented 10 months ago

Thank you for the clear and insightful explanation! I've learned a lot from your reply.

Background Information

my 2nd measurement.json

{ "measurement": { "model.layers.0.self_attn": [ { "accuracy": 0.9792405366897583, "total_bits": 319709184,

measurement.json downloaded from turboderp/Goliath-120B-exl2

{ "measurement": { "model.layers.0.self_attn": [ { "accuracy": 0.97910076379776, "total_bits": 319709184,


- Attached are the results from model_diff.py comparing the original model with the dataset-quantized model.

<details>
<summary>Result of model_diff.py</summary>

(exllamav2) PS F:\exllamav2> python.exe .\model_diff.py -ma F:\goliath-120b\ -mb F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\ -ed F:\wikitext\wikitext-2-v1\test-00000-of-00001.parquet -- Model A: F:\goliath-120b\ -- Model B: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\ -- Loading tokenizer -- Tokenizing eval data -- First 50 tokens of dataset: ' = Robert = \n Robert is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed' -- Last 50 tokens of dataset: 'by September 9 , the division continued to harass rear areas around with infiltrating groups as large as companies . daily had to open the main supply road and clear the town . \n North Korean and' -- Embeddings -- model.layers.0 (Attention) rfn_error: 0.024483 -- model.layers.0 (MLP) rfn_error: 0.046116 -- model.layers.1 (Attention) rfn_error: 0.049563 -- model.layers.1 (MLP) rfn_error: 0.073991 -- model.layers.2 (Attention) rfn_error: 0.071228 -- model.layers.2 (MLP) rfn_error: 0.032060 -- model.layers.3 (Attention) rfn_error: 0.034097 -- model.layers.3 (MLP) rfn_error: 0.050494 -- model.layers.4 (Attention) rfn_error: 0.053541 -- model.layers.4 (MLP) rfn_error: 0.062685 -- model.layers.5 (Attention) rfn_error: 0.064363 -- model.layers.5 (MLP) rfn_error: 0.070385 -- model.layers.6 (Attention) rfn_error: 0.072919 -- model.layers.6 (MLP) rfn_error: 0.079503 -- model.layers.7 (Attention) rfn_error: 0.081505 -- model.layers.7 (MLP) rfn_error: 0.083243 -- model.layers.8 (Attention) rfn_error: 0.085755 -- model.layers.8 (MLP) rfn_error: 0.034274 -- model.layers.9 (Attention) rfn_error: 0.035092 -- model.layers.9 (MLP) rfn_error: 0.038638 -- model.layers.10 (Attention) rfn_error: 0.039203 -- model.layers.10 (MLP) rfn_error: 0.042859 -- model.layers.11 (Attention) rfn_error: 0.043770 -- model.layers.11 (MLP) rfn_error: 0.047151 -- model.layers.12 (Attention) rfn_error: 0.048246 -- model.layers.12 (MLP) rfn_error: 0.051643 -- model.layers.13 (Attention) rfn_error: 0.053995 -- model.layers.13 (MLP) rfn_error: 0.057074 -- model.layers.14 (Attention) rfn_error: 0.059318 -- model.layers.14 (MLP) rfn_error: 0.061869 -- model.layers.15 (Attention) rfn_error: 0.064190 -- model.layers.15 (MLP) rfn_error: 0.066608 -- model.layers.16 (Attention) rfn_error: 0.067776 -- model.layers.16 (MLP) rfn_error: 0.058333 -- model.layers.17 (Attention) rfn_error: 0.058995 -- model.layers.17 (MLP) rfn_error: 0.058674 -- model.layers.18 (Attention) rfn_error: 0.058937 -- model.layers.18 (MLP) rfn_error: 0.059642 -- model.layers.19 (Attention) rfn_error: 0.060017 -- model.layers.19 (MLP) rfn_error: 0.061155 -- model.layers.20 (Attention) rfn_error: 0.061648 -- model.layers.20 (MLP) rfn_error: 0.063130 -- model.layers.21 (Attention) rfn_error: 0.064240 -- model.layers.21 (MLP) rfn_error: 0.066098 -- model.layers.22 (Attention) rfn_error: 0.067355 -- model.layers.22 (MLP) rfn_error: 0.069553 -- model.layers.23 (Attention) rfn_error: 0.071171 -- model.layers.23 (MLP) rfn_error: 0.073588 -- model.layers.24 (Attention) rfn_error: 0.074599 -- model.layers.24 (MLP) rfn_error: 0.075697 -- model.layers.25 (Attention) rfn_error: 0.076972 -- model.layers.25 (MLP) rfn_error: 0.077565 -- model.layers.26 (Attention) rfn_error: 0.078772 -- model.layers.26 (MLP) rfn_error: 0.079835 -- model.layers.27 (Attention) rfn_error: 0.081497 -- model.layers.27 (MLP) rfn_error: 0.083221 -- model.layers.28 (Attention) rfn_error: 0.084244 -- model.layers.28 (MLP) rfn_error: 0.086398 -- model.layers.29 (Attention) rfn_error: 0.087746 -- model.layers.29 (MLP) rfn_error: 0.089982 -- model.layers.30 (Attention) rfn_error: 0.090766 -- model.layers.30 (MLP) rfn_error: 0.093338 -- model.layers.31 (Attention) rfn_error: 0.094021 -- model.layers.31 (MLP) rfn_error: 0.096699 -- model.layers.32 (Attention) rfn_error: 0.097867 -- model.layers.32 (MLP) rfn_error: 0.098807 -- model.layers.33 (Attention) rfn_error: 0.100072 -- model.layers.33 (MLP) rfn_error: 0.104670 -- model.layers.34 (Attention) rfn_error: 0.106276 -- model.layers.34 (MLP) rfn_error: 0.113756 -- model.layers.35 (Attention) rfn_error: 0.115171 -- model.layers.35 (MLP) rfn_error: 0.127204 -- model.layers.36 (Attention) rfn_error: 0.128823 -- model.layers.36 (MLP) rfn_error: 0.147130 -- model.layers.37 (Attention) rfn_error: 0.150576 -- model.layers.37 (MLP) rfn_error: 0.223778 -- model.layers.38 (Attention) rfn_error: 0.229935 -- model.layers.38 (MLP) rfn_error: 0.439989 -- model.layers.39 (Attention) rfn_error: 0.446495 -- model.layers.39 (MLP) rfn_error: 0.798936 -- model.layers.40 (Attention) rfn_error: 0.804838 -- model.layers.40 (MLP) rfn_error: 1.211245 -- model.layers.41 (Attention) rfn_error: 1.220758 -- model.layers.41 (MLP) rfn_error: 1.619035 -- model.layers.42 (Attention) rfn_error: 1.627340 -- model.layers.42 (MLP) rfn_error: 2.139340 -- model.layers.43 (Attention) rfn_error: 2.149149 -- model.layers.43 (MLP) rfn_error: 2.639979 -- model.layers.44 (Attention) rfn_error: 2.645124 -- model.layers.44 (MLP) rfn_error: 3.165941 -- model.layers.45 (Attention) rfn_error: 3.177646 -- model.layers.45 (MLP) rfn_error: 3.798840 -- model.layers.46 (Attention) rfn_error: 3.803255 -- model.layers.46 (MLP) rfn_error: 4.371706 -- model.layers.47 (Attention) rfn_error: 4.340717 -- model.layers.47 (MLP) rfn_error: 4.741758 -- model.layers.48 (Attention) rfn_error: 4.729691 -- model.layers.48 (MLP) rfn_error: 5.294932 -- model.layers.49 (Attention) rfn_error: 5.281461 -- model.layers.49 (MLP) rfn_error: 5.918711 -- model.layers.50 (Attention) rfn_error: 5.915104 -- model.layers.50 (MLP) rfn_error: 6.617719 -- model.layers.51 (Attention) rfn_error: 6.607515 -- model.layers.51 (MLP) rfn_error: 7.483106 -- model.layers.52 (Attention) rfn_error: 7.486366 -- model.layers.52 (MLP) rfn_error: 8.581953 -- model.layers.53 (Attention) rfn_error: 8.575595 -- model.layers.53 (MLP) rfn_error: 9.761292 -- model.layers.54 (Attention) rfn_error: 9.748262 -- model.layers.54 (MLP) rfn_error: 11.066325 -- model.layers.55 (Attention) rfn_error: 11.042789 -- model.layers.55 (MLP) rfn_error: 12.212918 -- model.layers.56 (Attention) rfn_error: 12.215701 -- model.layers.56 (MLP) rfn_error: 13.551597 -- model.layers.57 (Attention) rfn_error: 13.554549 -- model.layers.57 (MLP) rfn_error: 14.623949 -- model.layers.58 (Attention) rfn_error: 14.579722 -- model.layers.58 (MLP) rfn_error: 15.280513 -- model.layers.59 (Attention) rfn_error: 15.202538 -- model.layers.59 (MLP) rfn_error: 15.615108 -- model.layers.60 (Attention) rfn_error: 15.555171 -- model.layers.60 (MLP) rfn_error: 15.836576 -- model.layers.61 (Attention) rfn_error: 15.728579 -- model.layers.61 (MLP) rfn_error: 15.879166 -- model.layers.62 (Attention) rfn_error: 15.701415 -- model.layers.62 (MLP) rfn_error: 15.813448 -- model.layers.63 (Attention) rfn_error: 15.727242 -- model.layers.63 (MLP) rfn_error: 15.877780 -- model.layers.64 (Attention) rfn_error: 15.817752 -- model.layers.64 (MLP) rfn_error: 15.953315 -- model.layers.65 (Attention) rfn_error: 15.864428 -- model.layers.65 (MLP) rfn_error: 15.970726 -- model.layers.66 (Attention) rfn_error: 15.877405 -- model.layers.66 (MLP) rfn_error: 15.972544 -- model.layers.67 (Attention) rfn_error: 15.896235 -- model.layers.67 (MLP) rfn_error: 15.968322 -- model.layers.68 (Attention) rfn_error: 15.857764 -- model.layers.68 (MLP) rfn_error: 15.855494 -- model.layers.69 (Attention) rfn_error: 15.734972 -- model.layers.69 (MLP) rfn_error: 15.854070 -- model.layers.70 (Attention) rfn_error: 15.742919 -- model.layers.70 (MLP) rfn_error: 15.853381 -- model.layers.71 (Attention) rfn_error: 15.714951 -- model.layers.71 (MLP) rfn_error: 15.786237 -- model.layers.72 (Attention) rfn_error: 15.721155 -- model.layers.72 (MLP) rfn_error: 15.747146 -- model.layers.73 (Attention) rfn_error: 15.562919 -- model.layers.73 (MLP) rfn_error: 15.626210 -- model.layers.74 (Attention) rfn_error: 15.568840 -- model.layers.74 (MLP) rfn_error: 15.643689 -- model.layers.75 (Attention) rfn_error: 15.622794 -- model.layers.75 (MLP) rfn_error: 15.706998 -- model.layers.76 (Attention) rfn_error: 15.575759 -- model.layers.76 (MLP) rfn_error: 15.633969 -- model.layers.77 (Attention) rfn_error: 15.444465 -- model.layers.77 (MLP) rfn_error: 15.392360 -- model.layers.78 (Attention) rfn_error: 15.254514 -- model.layers.78 (MLP) rfn_error: 15.175502 -- model.layers.79 (Attention) rfn_error: 15.120242 -- model.layers.79 (MLP) rfn_error: 14.997802 -- model.layers.80 (Attention) rfn_error: 14.859291 -- model.layers.80 (MLP) rfn_error: 14.805400 -- model.layers.81 (Attention) rfn_error: 14.763048 -- model.layers.81 (MLP) rfn_error: 14.735511 -- model.layers.82 (Attention) rfn_error: 14.716822 -- model.layers.82 (MLP) rfn_error: 14.709574 -- model.layers.83 (Attention) rfn_error: 14.621709 -- model.layers.83 (MLP) rfn_error: 14.604877 -- model.layers.84 (Attention) rfn_error: 14.582900 -- model.layers.84 (MLP) rfn_error: 14.671310 -- model.layers.85 (Attention) rfn_error: 14.658393 -- model.layers.85 (MLP) rfn_error: 14.750892 -- model.layers.86 (Attention) rfn_error: 14.741389 -- model.layers.86 (MLP) rfn_error: 14.844012 -- model.layers.87 (Attention) rfn_error: 14.819396 -- model.layers.87 (MLP) rfn_error: 14.909616 -- model.layers.88 (Attention) rfn_error: 14.879453 -- model.layers.88 (MLP) rfn_error: 14.978202 -- model.layers.89 (Attention) rfn_error: 14.964467 -- model.layers.89 (MLP) rfn_error: 15.051505 -- model.layers.90 (Attention) rfn_error: 15.047162 -- model.layers.90 (MLP) rfn_error: 15.144187 -- model.layers.91 (Attention) rfn_error: 15.112198 -- model.layers.91 (MLP) rfn_error: 15.191370 -- model.layers.92 (Attention) rfn_error: 15.163075 -- model.layers.92 (MLP) rfn_error: 15.152829 -- model.layers.93 (Attention) rfn_error: 15.132187 -- model.layers.93 (MLP) rfn_error: 15.142163 -- model.layers.94 (Attention) rfn_error: 15.104807 -- model.layers.94 (MLP) rfn_error: 15.111040 -- model.layers.95 (Attention) rfn_error: 15.075363 -- model.layers.95 (MLP) rfn_error: 15.100779 -- model.layers.96 (Attention) rfn_error: 15.080756 -- model.layers.96 (MLP) rfn_error: 15.095042 -- model.layers.97 (Attention) rfn_error: 15.087076 -- model.layers.97 (MLP) rfn_error: 15.116823 -- model.layers.98 (Attention) rfn_error: 15.081482 -- model.layers.98 (MLP) rfn_error: 15.093150 -- model.layers.99 (Attention) rfn_error: 15.076149 -- model.layers.99 (MLP) rfn_error: 15.182627 -- model.layers.100 (Attention) rfn_error: 15.180132 -- model.layers.100 (MLP) rfn_error: 15.286374 -- model.layers.101 (Attention) rfn_error: 15.265441 -- model.layers.101 (MLP) rfn_error: 15.365915 -- model.layers.102 (Attention) rfn_error: 15.362569 -- model.layers.102 (MLP) rfn_error: 15.443079 -- model.layers.103 (Attention) rfn_error: 15.441002 -- model.layers.103 (MLP) rfn_error: 15.420868 -- model.layers.104 (Attention) rfn_error: 15.420726 -- model.layers.104 (MLP) rfn_error: 15.401339 -- model.layers.105 (Attention) rfn_error: 15.396993 -- model.layers.105 (MLP) rfn_error: 15.378838 -- model.layers.106 (Attention) rfn_error: 15.380030 -- model.layers.106 (MLP) rfn_error: 15.355573 -- model.layers.107 (Attention) rfn_error: 15.342151 -- model.layers.107 (MLP) rfn_error: 15.347394 -- model.layers.108 (Attention) rfn_error: 15.309440 -- model.layers.108 (MLP) rfn_error: 15.316791 -- model.layers.109 (Attention) rfn_error: 15.307363 -- model.layers.109 (MLP) rfn_error: 15.320192 -- model.layers.110 (Attention) rfn_error: 15.315373 -- model.layers.110 (MLP) rfn_error: 15.340410 -- model.layers.111 (Attention) rfn_error: 15.335851 -- model.layers.111 (MLP) rfn_error: 15.363124 -- model.layers.112 (Attention) rfn_error: 15.352658 -- model.layers.112 (MLP) rfn_error: 15.379140 -- model.layers.113 (Attention) rfn_error: 15.376225 -- model.layers.113 (MLP) rfn_error: 15.383567 -- model.layers.114 (Attention) rfn_error: 15.383359 -- model.layers.114 (MLP) rfn_error: 15.354123 -- model.layers.115 (Attention) rfn_error: 15.358710 -- model.layers.115 (MLP) rfn_error: 15.344510 -- model.layers.116 (Attention) rfn_error: 15.340006 -- model.layers.116 (MLP) rfn_error: 15.312299 -- model.layers.117 (Attention) rfn_error: 15.317131 -- model.layers.117 (MLP) rfn_error: 15.307445 -- model.layers.118 (Attention) rfn_error: 15.313990 -- model.layers.118 (MLP) rfn_error: 15.295371 -- model.layers.119 (Attention) rfn_error: 15.272911 -- model.layers.119 (MLP) rfn_error: 15.244082 -- model.layers.120 (Attention) rfn_error: 15.223117 -- model.layers.120 (MLP) rfn_error: 15.185397 -- model.layers.121 (Attention) rfn_error: 15.152503 -- model.layers.121 (MLP) rfn_error: 15.072214 -- model.layers.122 (Attention) rfn_error: 15.076782 -- model.layers.122 (MLP) rfn_error: 15.085037 -- model.layers.123 (Attention) rfn_error: 15.075885 -- model.layers.123 (MLP) rfn_error: 14.993953 -- model.layers.124 (Attention) rfn_error: 14.997370 -- model.layers.124 (MLP) rfn_error: 14.925478 -- model.layers.125 (Attention) rfn_error: 14.924323 -- model.layers.125 (MLP) rfn_error: 14.836319 -- model.layers.126 (Attention) rfn_error: 14.798179 -- model.layers.126 (MLP) rfn_error: 14.697093 -- model.layers.127 (Attention) rfn_error: 14.661101 -- model.layers.127 (MLP) rfn_error: 14.556688 -- model.layers.128 (Attention) rfn_error: 14.508231 -- model.layers.128 (MLP) rfn_error: 14.398990 -- model.layers.129 (Attention) rfn_error: 14.377891 -- model.layers.129 (MLP) rfn_error: 14.318698 -- model.layers.130 (Attention) rfn_error: 14.309216 -- model.layers.130 (MLP) rfn_error: 14.233741 -- model.layers.131 (Attention) rfn_error: 14.189313 -- model.layers.131 (MLP) rfn_error: 14.103726 -- model.layers.132 (Attention) rfn_error: 14.046529 -- model.layers.132 (MLP) rfn_error: 13.988785 -- model.layers.133 (Attention) rfn_error: 13.889768 -- model.layers.133 (MLP) rfn_error: 13.828109 -- model.layers.134 (Attention) rfn_error: 13.744127 -- model.layers.134 (MLP) rfn_error: 13.537598 -- model.layers.135 (Attention) rfn_error: 13.459430 -- model.layers.135 (MLP) rfn_error: 13.241351 -- model.layers.136 (Attention) rfn_error: 13.227460 -- model.layers.136 (MLP) rfn_error: 13.241922 -- model.norm (RMSNorm) rfn_error: 0.690526 -- lm_head (Linear) rfn_error: 0.533024 -- Testing outputs


4.92154110;37.95624100

2.39380025 0.00000518

1;0.62415730;0.47496336;0.65036639 2;0.73568637;0.56289692;0.38629702 3;0.78891060;0.60779189;0.19167074 4;0.82217880;0.63595506;0.08119199 5;0.84445530;0.65483635;0.03231558

0;0.02448289 1;0.04611557 2;0.04956301 3;0.07399067 4;0.07122838 5;0.03206011 6;0.03409727 7;0.05049352 8;0.05354137 9;0.06268534 10;0.06436253 11;0.07038530 12;0.07291865 13;0.07950331 14;0.08150543 15;0.08324279 16;0.08575454 17;0.03427399 18;0.03509181 19;0.03863776 20;0.03920262 21;0.04285876 22;0.04376979 23;0.04715057 24;0.04824552 25;0.05164316 26;0.05399549 27;0.05707350 28;0.05931760 29;0.06186888 30;0.06418952 31;0.06660769 32;0.06777596 33;0.05833259 34;0.05899484 35;0.05867357 36;0.05893672 37;0.05964238 38;0.06001712 39;0.06115475 40;0.06164806 41;0.06312989 42;0.06424034 43;0.06609832 44;0.06735466 45;0.06955310 46;0.07117135 47;0.07358777 48;0.07459857 49;0.07569653 50;0.07697208 51;0.07756539 52;0.07877212 53;0.07983511 54;0.08149701 55;0.08322144 56;0.08424431 57;0.08639779 58;0.08774580 59;0.08998214 60;0.09076612 61;0.09333778 62;0.09402101 63;0.09669860 64;0.09786724 65;0.09880704 66;0.10007246 67;0.10466992 68;0.10627639 69;0.11375573 70;0.11517064 71;0.12720367 72;0.12882319 73;0.14712983 74;0.15057568 75;0.22377764 76;0.22993474 77;0.43998882 78;0.44649488 79;0.79893553 80;0.80483830 81;1.21124494 82;1.22075784 83;1.61903477 84;1.62734032 85;2.13934016 86;2.14914894 87;2.63997912 88;2.64512396 89;3.16594148 90;3.17764592 91;3.79884028 92;3.80325484 93;4.37170553 94;4.34071732 95;4.74175835 96;4.72969103 97;5.29493189 98;5.28146124 99;5.91871119 100;5.91510439 101;6.61771870 102;6.60751486 103;7.48310566 104;7.48636580 105;8.58195305 106;8.57559490 107;9.76129246 108;9.74826241 109;11.06632519 110;11.04278946 111;12.21291828 112;12.21570110 113;13.55159664 114;13.55454922 115;14.62394905 116;14.57972240 117;15.28051281 118;15.20253754 119;15.61510754 120;15.55517101 121;15.83657551 122;15.72857857 123;15.87916565 124;15.70141506 125;15.81344795 126;15.72724152 127;15.87777996 128;15.81775188 129;15.95331478 130;15.86442757 131;15.97072601 132;15.87740517 133;15.97254372 134;15.89623451 135;15.96832180 136;15.85776424 137;15.85549450 138;15.73497200 139;15.85406971 140;15.74291897 141;15.85338116 142;15.71495056 143;15.78623676 144;15.72115517 145;15.74714565 146;15.56291866 147;15.62621021 148;15.56884003 149;15.64368916 150;15.62279415 151;15.70699787 152;15.57575893 153;15.63396931 154;15.44446468 155;15.39235973 156;15.25451374 157;15.17550182 158;15.12024212 159;14.99780178 160;14.85929108 161;14.80539989 162;14.76304817 163;14.73551083 164;14.71682167 165;14.70957375 166;14.62170887 167;14.60487652 168;14.58290005 169;14.67131042 170;14.65839291 171;14.75089169 172;14.74138927 173;14.84401226 174;14.81939602 175;14.90961647 176;14.87945271 177;14.97820187 178;14.96446705 179;15.05150509 180;15.04716206 181;15.14418697 182;15.11219788 183;15.19137001 184;15.16307545 185;15.15282917 186;15.13218689 187;15.14216328 188;15.10480690 189;15.11104012 190;15.07536316 191;15.10077858 192;15.08075619 193;15.09504223 194;15.08707619 195;15.11682320 196;15.08148193 197;15.09315014 198;15.07614899 199;15.18262672 200;15.18013191 201;15.28637409 202;15.26544094 203;15.36591530 204;15.36256886 205;15.44307899 206;15.44100189 207;15.42086792 208;15.42072582 209;15.40133858 210;15.39699268 211;15.37883759 212;15.38002968 213;15.35557270 214;15.34215069 215;15.34739399 216;15.30943966 217;15.31679058 218;15.30736256 219;15.32019234 220;15.31537342 221;15.34041023 222;15.33585072 223;15.36312389 224;15.35265827 225;15.37913990 226;15.37622547 227;15.38356686 228;15.38335896 229;15.35412312 230;15.35871029 231;15.34451008 232;15.34000587 233;15.31229877 234;15.31713104 235;15.30744457 236;15.31398964 237;15.29537106 238;15.27291107 239;15.24408245 240;15.22311687 241;15.18539715 242;15.15250301 243;15.07221413 244;15.07678223 245;15.08503723 246;15.07588482 247;14.99395275 248;14.99736977 249;14.92547798 250;14.92432308 251;14.83631897 252;14.79817867 253;14.69709301 254;14.66110134 255;14.55668831 256;14.50823116 257;14.39898968 258;14.37789059 259;14.31869793 260;14.30921650 261;14.23374081 262;14.18931293 263;14.10372639 264;14.04652882 265;13.98878479 266;13.88976765 267;13.82810879 268;13.74412727 269;13.53759766 270;13.45942974 271;13.24135113 272;13.22745991 273;13.24192238 274;0.69052577 275;0.53302395


-- A, ppl: 4.92154110 acc: 0.6242 0.7357 0.7889 0.8222 0.8445 -- B, ppl: 37.95624100 acc: 0.4750 0.5629 0.6078 0.6360 0.6548 -- Top-K agreement: 0.6504 0.3863 0.1917 0.0812 0.0323 -- KL divergence: 2.39380025 -- MSE: 0.00000518



</details>

### Questions

1. Could the higher perplexity observed during quantization with the pippa dataset be attributed to different versions of exllamav2? I plan to download and use the code from around December 20, 2023, to reattempt quantization and see if it improves the perplexity of the quantized model.

2. Have there been any significant updates or changes to convert.py since December 20, 2023?

3. What might be the reasons behind, for the result of model_diff.py, the substantial error increase around layer 35 and the minor fluctuations observed after layer 70?

I appreciate your time and look forward to your insights and suggestions!
MatrixC7 commented 10 months ago
  • A friend of mine recently shared that his rpcal model exhibited a considerably lower perplexity value (around 8), despite his saying using identical commands, parameters, and dataset as mine. The notable difference is the version of exllamav2 used during his quantization around December 20, 2023.
    1. Could the higher perplexity observed during quantization with the pippa dataset be attributed to different versions of exllamav2? I plan to download and use the code from around December 20, 2023, to reattempt quantization and see if it improves the perplexity of the quantized model.

I've tried the commit version 162fc5d. However, there were no changes in either the measurements or the perplexity for the wikitext.

# with commit version 162fc5d
{
    "measurement": {
        "model.layers.0.self_attn": [
            {
                "accuracy": 0.985510528087616,
                "total_bits": 319709184,

# with commit version 26ffee3
{
    "measurement": {
        "model.layers.0.self_attn": [
            {
                "accuracy": 0.985510528087616,
                "total_bits": 319709184,                

# Finishing the quantization
 -- Module quantized, calibration perplexity (quant): 7.5700

# Perplexity with wikitext-103-v1
(exllamav2) PS E:\Download\exllamav2-162fc5d62c6d329f8492a8ab8424e5ad05da3dbb> python.exe .\test_inference.py -m F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2-20240107-oldversion\ -ed F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet -gs 18,24
 -- Model: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2-20240107-oldversion\
 -- Options: ['gpu_split: 18,24', 'rope_scale: 1.0', 'rope_alpha: 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Running perplexity test
 -- Dataset: F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet
 -- Tokenizing eval data, 128 rows x 2048 tokens...
 -- First 50 tokens of dataset:
    ' = Robert Boulter = \n  Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
 -- Last 50 tokens of dataset:
    'was secured by community activists for the first time on 5 January 1969 following an incursion into the <unk> by members of the Royal Ulster Constabulary ( RUC ) . Residents built barric'
 -- Inference.............
 -- Evaluation perplexity: 178.4219

I also came across Panchovix's quantization which utilized the same dataset for calibration. His quantization doesn't seem to suffer from the high perplexity that I'm experiencing, which adds to my confusion.

(exllamav2) PS E:\Download\exllamav2-162fc5d62c6d329f8492a8ab8424e5ad05da3dbb> python.exe .\test_inference.py -m F:\llm-models-exl2\Panchovix_goliath-120b-exl2-rpcal_3bpw\ -ed F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet -gs 22.8,24
 -- Model: F:\llm-models-exl2\Panchovix_goliath-120b-exl2-rpcal_3bpw\
 -- Options: ['gpu_split: 22.8,24', 'rope_scale: 1.0', 'rope_alpha: 1.0']
 -- Loading model...
 -- Loading tokenizer...
 -- Running perplexity test
 -- Dataset: F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet
 -- Tokenizing eval data, 128 rows x 2048 tokens...
 -- First 50 tokens of dataset:
    ' = Robert Boulter = \n  Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
 -- Last 50 tokens of dataset:
    'was secured by community activists for the first time on 5 January 1969 following an incursion into the <unk> by members of the Royal Ulster Constabulary ( RUC ) . Residents built barric'
 -- Inference.............
 -- Evaluation perplexity: 12.7348

Given that my methods for quantizing the model with the dataset appears correct, I'm puzzled as to why my results differ significantly. Additionally, I've verified the functionality of my GPU: it successfully quantizes models when not using a specified dataset.

Could there be an underlying factor I'm overlooking that might account for this discrepancy in perplexity?

turboderp commented 10 months ago

I'll look into this a little more later, but is it possible your Pippa dataset is bad? I'm also suspecting possible character encoding issues related to Windows. I'll be in a better position to test on Windows soon, but in the meantime I guess I can try converting with Pippa later today to see if I get the same behavior on Linux. Do you have a link to where you got your exact copy of the parquet file?

MatrixC7 commented 10 months ago

I'll look into this a little more later, but is it possible your Pippa dataset is bad? I'm also suspecting possible character encoding issues related to Windows. I'll be in a better position to test on Windows soon, but in the meantime I guess I can try converting with Pippa later today to see if I get the same behavior on Linux. Do you have a link to where you got your exact copy of the parquet file?

The PIPPA parquet file is from huggingface royallab/PIPPA-cleaned. I have also thought of this possibility but the SHA256 checksum keeps the same E3792FFD85EBB51B05F7636E54F67CB64239D980C6FB29E888BE744E286FF997.

MatrixC7 commented 10 months ago

Sorry to trouble you but any progress?

turboderp commented 10 months ago

I've had some delays upgrading my PC, so everything's been a little disassembled recently and I haven't had a chance to try converting this model again. It basically prevents me from working on anything else for several hours, so I'm going to try to fit that in somewhere, but I'm not sure when exactly.

It'll also be a little while longer before I have a Windows PC set up, and it really looks like it's a Windows-specific issue you're running into. It could very well have to do with character encoding.

MatrixC7 commented 10 months ago

No worries. I would try to perform the quantization in wsl to check whether it could operate successfully. Later I would you know the update.

MatrixC7 commented 10 months ago

Here's the quantized version in WSL, but the perplexity checked in Windows is still high at 34. :( I should note that my Windows system is set to CJK, and I have enabled the option Beta: Use Unicode UTF-8 for worldwide language support. I'll also give it a try in a Windows virtual machine with an English (EN) locale when I have the time.

(exllamav2) PS F:\exllamav2> python.exe .\test_inference.py -m F:\llm-models-exl2\goliath-120b-rpcal-2.65bpw-h6-exl2-20240118\ -ed F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet -gs 19,24
 -- Model: F:\llm-models-exl2\goliath-120b-rpcal-2.65bpw-h6-exl2-20240118\
 -- Options: ['gpu_split: 19,24']
 -- Loading model...
 -- Loading tokenizer...
 -- Running perplexity test
 -- Dataset: F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet
 -- Tokenizing eval data, 128 rows x 2048 tokens...
 -- First 50 tokens of dataset:
    ' = Robert Boulter = \n  Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
 -- Last 50 tokens of dataset:
    'was secured by community activists for the first time on 5 January 1969 following an incursion into the <unk> by members of the Royal Ulster Constabulary ( RUC ) . Residents built barric'
 -- Inference.............
 -- Evaluation perplexity: 34.6642
MatrixC7 commented 10 months ago

Finally, some good news!

The primary reason for the calibration failure is likely not related to the version of ExLlamaV2 or the Windows/Linux encoding, but rather to the dataset.

My entire checking process for all three possibilities:

ExLlamaV2 Version Windows/Linux Dataset Perplexity
old commit 162fc5d Windows (zh-cn locale) royallab/PIPPA-cleaned ×, 30+
new commit 26ffee3 Windows (zh-cn locale) royallab/PIPPA-cleaned ×, 30+
new commit 26ffee3 Windows (zh-cn locale) royallab/PIPPA-cleaned ×, 30+
new commit 26ffee3 WSL in Windows (zh-cn locale) royallab/PIPPA-cleaned ×, 30+
new commit 26ffee3 Windows (en-us locale) royallab/PIPPA-cleaned ×, 30+
new commit 26ffee3 Windows (zh-cn locale) royallab/PIPPA-cleaned ×, 30+
new commit 26ffee3 Windows (zh-cn locale) VatsaDev/worldbuild √, 7.4

However, I'm puzzled about why the parquet file from royallab/PIPPA-cleaned yields unsatisfactory results. To complement the parquet I use is the one provided directly in the hugging face repository, rather then the one converted by the bot in branch refs/convert/parquet. Therefore it leads me to speculate that the problematic parquet file might be the one provided directly, especially since the parquet file from VatsaDev/worldbuild that operates normally is converted from .jsonl by the bot. I plan to revisit this issue when I have more time.

Anyway I am happy that two weeks' efforts are not in vain. I once again appreciate your detailed explanation and your ongoing concern regarding this issue!

MatrixC7 commented 10 months ago

Just finished the calibration with the dataset of PIPPA converted from the bot, still no luck as Evaluation perplexity: 46.7529. Maybe I just have no fortune for this dataset man. :( But anyway I would use the worldbuild dataset to do the calibration from now on. Closing the issue!