Closed MatrixC7 closed 10 months ago
First off, Goliath is a bit of a wildcard, because it's a Frankenstein model. It's two 70B finetunes sliced up and glued back together, so it's hard to speculate as to why it works when it does, or what might make it fail to perform well under extreme circumstances like heavy quantization.
As for your questions:
You're using the script correctly from what I can tell. But whether it's appropriate in the first place to use a task-specific calibration set is another matter. In theory it makes sense: you want the model to be better at reproducing internal states it would experience while doing RP, and calibrating with a dataset like Pippa might in theory do that. But the more aggressive the quantization, the more heavily the quantizer is going to rely on the calibration data, to the point where at 2.4bpw it might be overfitting somewhat.
The perplexity is going to vary with the difference between the calibration dataset and the test dataset, though I generally don't see it varying this much. Values as high as 141 do suggest that the model is working, but obviously the results will be unacceptable. Still, this is different from results in the thousands or millions, which would be the case if the weights had been completely scrambled due to an outright bug in the code. So I would assume quantization is working "as intended" but you've hit a limit with that combination of model, dataset and bitrate.
The calibration process is the same as what's described in the GPTQ paper, which is a variation on OBQ.
The basic idea is that each quantized matrix is produced not by rounding the floating-point weights to the nearest point on a quantization grid, but instead to treat it as a reconstruction problem: Find the quantized matrix that does the same thing as the original, with the smallest possible error with respect to a representative sample of all the actual input states that the matrix is going to be working on in inference.
For a high enough bitrate, the solution is trivial, but as you lower the bitrate the solution starts to rely more and more on correlations between weights and patterns in the input. E.g. if the input data always shows two parameters perfectly correlating, the corresponding weights can "cooperate" to project those two coordinates more precisely.
The idea is that you're going to have an error one way or another, but you can concentrate the error in the unimportant features, shifting it away from the important ones. But that only works as long as the calibration data is actually representative, and as long as it remains representative throughout the forward pass. Otherwise you end up trying to navigate the space that you shifted the error into, and that's obviously problematic.
The dataset is built here for reference. The reason it's not just a .parquet file is that it's tokenizer-aware (you can't get random token IDs from tokenizing random text, for instance).
I've found this to produce much more stable conversions overall. To be clear, it's not that you can't use a custom dataset. But the more specialized it is, and the more aggressive the quantization, the more you're going to also specialize the quantized model. And I can't give you any pointers on how to recognize when a dataset is too narrow, because I just don't know at the end of the day. Especially since it likely depends on the model undergoing conversion, and as mentioned Goliath is a wildcard.
-mr
is documented here and -gr
was removed around version 0.0.10.
You can run python test_inference.py -h
for a complete list of arguments. They're not thoroughly documented, simply because I don't have the time to keep an updated reference. They also change from time to time since a lot of it is just for experimentation.
The problem you're having with model_diff.py
is due to me trying to work around some shortcomings of flash-attn. I made some changes to the attention interface and just forgot to update that particular script. It's fixed with the latest commit.
I noticed in your tests there are a lot of instances of <unk>
in one of your wikitext datasets. I'm not sure what that's about but it would probably be a good idea to confirm that the dataset hasn't been incorrectly converted at some point. Otherwise it could indicate a tokenization problem with ExLlamaV2.
Anyway, I converted Goliath-120B to 2.4bpw with the built-in dataset, and here are the results I'm getting:
bpw | wiki (128 rows) | c4 (20 rows) | pippa (20 rows) |
---|---|---|---|
2.4 | 8.62 | 6.83 | 7.42 |
3.0 | 7.95 | 6.29 | 7.59 |
FP16 | 6.89 | 5.91 | 8.03 |
The results for Pippa are anomalous and perhaps that's worth looking into. But again, Goliath is not expected to be a well-behaved model, so there are a lot of variables to consider. Overall though the results look fairly reasonable, I think.
I'm uploading the 2.4bpw model here but with my upload bandwidth it will take several hours. Once it's done you could use that for reference, I guess.
Thank you for the clear and insightful explanation! I've learned a lot from your reply.
A friend of mine recently shared that his rpcal model exhibited a considerably lower perplexity value (around 8), despite his saying using identical commands, parameters, and dataset as mine. The notable difference is the version of exllamav2 used during his quantization around December 20, 2023.
I've conducted measurements on the goliath-120b model using the pippa dataset twice. I found that for a fixed dataset, the measurements are the same and reproducible.
Following your advice, I noticed that the dataset is partially randomized in the code. When I measured the goliath-120b model without specifying a different dataset using -c
, the two resulting measurement files differed indeed.
However, there's a noticeable discrepancy between my generated measurements and your provided measurement file in terms of precision.
# my 1st measurement.json
{
"measurement": {
"model.layers.0.self_attn": [
{
"accuracy": 0.9787838459014893,
"total_bits": 319709184,
{ "measurement": { "model.layers.0.self_attn": [ { "accuracy": 0.9792405366897583, "total_bits": 319709184,
{ "measurement": { "model.layers.0.self_attn": [ { "accuracy": 0.97910076379776, "total_bits": 319709184,
- Attached are the results from model_diff.py comparing the original model with the dataset-quantized model.
<details>
<summary>Result of model_diff.py</summary>
(exllamav2) PS F:\exllamav2> python.exe .\model_diff.py -ma F:\goliath-120b\ -mb F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\ -ed F:\wikitext\wikitext-2-v1\test-00000-of-00001.parquet
-- Model A: F:\goliath-120b\
-- Model B: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2\
-- Loading tokenizer
-- Tokenizing eval data
-- First 50 tokens of dataset:
' = Robert
4.92154110;37.95624100
2.39380025 0.00000518
1;0.62415730;0.47496336;0.65036639 2;0.73568637;0.56289692;0.38629702 3;0.78891060;0.60779189;0.19167074 4;0.82217880;0.63595506;0.08119199 5;0.84445530;0.65483635;0.03231558
0;0.02448289 1;0.04611557 2;0.04956301 3;0.07399067 4;0.07122838 5;0.03206011 6;0.03409727 7;0.05049352 8;0.05354137 9;0.06268534 10;0.06436253 11;0.07038530 12;0.07291865 13;0.07950331 14;0.08150543 15;0.08324279 16;0.08575454 17;0.03427399 18;0.03509181 19;0.03863776 20;0.03920262 21;0.04285876 22;0.04376979 23;0.04715057 24;0.04824552 25;0.05164316 26;0.05399549 27;0.05707350 28;0.05931760 29;0.06186888 30;0.06418952 31;0.06660769 32;0.06777596 33;0.05833259 34;0.05899484 35;0.05867357 36;0.05893672 37;0.05964238 38;0.06001712 39;0.06115475 40;0.06164806 41;0.06312989 42;0.06424034 43;0.06609832 44;0.06735466 45;0.06955310 46;0.07117135 47;0.07358777 48;0.07459857 49;0.07569653 50;0.07697208 51;0.07756539 52;0.07877212 53;0.07983511 54;0.08149701 55;0.08322144 56;0.08424431 57;0.08639779 58;0.08774580 59;0.08998214 60;0.09076612 61;0.09333778 62;0.09402101 63;0.09669860 64;0.09786724 65;0.09880704 66;0.10007246 67;0.10466992 68;0.10627639 69;0.11375573 70;0.11517064 71;0.12720367 72;0.12882319 73;0.14712983 74;0.15057568 75;0.22377764 76;0.22993474 77;0.43998882 78;0.44649488 79;0.79893553 80;0.80483830 81;1.21124494 82;1.22075784 83;1.61903477 84;1.62734032 85;2.13934016 86;2.14914894 87;2.63997912 88;2.64512396 89;3.16594148 90;3.17764592 91;3.79884028 92;3.80325484 93;4.37170553 94;4.34071732 95;4.74175835 96;4.72969103 97;5.29493189 98;5.28146124 99;5.91871119 100;5.91510439 101;6.61771870 102;6.60751486 103;7.48310566 104;7.48636580 105;8.58195305 106;8.57559490 107;9.76129246 108;9.74826241 109;11.06632519 110;11.04278946 111;12.21291828 112;12.21570110 113;13.55159664 114;13.55454922 115;14.62394905 116;14.57972240 117;15.28051281 118;15.20253754 119;15.61510754 120;15.55517101 121;15.83657551 122;15.72857857 123;15.87916565 124;15.70141506 125;15.81344795 126;15.72724152 127;15.87777996 128;15.81775188 129;15.95331478 130;15.86442757 131;15.97072601 132;15.87740517 133;15.97254372 134;15.89623451 135;15.96832180 136;15.85776424 137;15.85549450 138;15.73497200 139;15.85406971 140;15.74291897 141;15.85338116 142;15.71495056 143;15.78623676 144;15.72115517 145;15.74714565 146;15.56291866 147;15.62621021 148;15.56884003 149;15.64368916 150;15.62279415 151;15.70699787 152;15.57575893 153;15.63396931 154;15.44446468 155;15.39235973 156;15.25451374 157;15.17550182 158;15.12024212 159;14.99780178 160;14.85929108 161;14.80539989 162;14.76304817 163;14.73551083 164;14.71682167 165;14.70957375 166;14.62170887 167;14.60487652 168;14.58290005 169;14.67131042 170;14.65839291 171;14.75089169 172;14.74138927 173;14.84401226 174;14.81939602 175;14.90961647 176;14.87945271 177;14.97820187 178;14.96446705 179;15.05150509 180;15.04716206 181;15.14418697 182;15.11219788 183;15.19137001 184;15.16307545 185;15.15282917 186;15.13218689 187;15.14216328 188;15.10480690 189;15.11104012 190;15.07536316 191;15.10077858 192;15.08075619 193;15.09504223 194;15.08707619 195;15.11682320 196;15.08148193 197;15.09315014 198;15.07614899 199;15.18262672 200;15.18013191 201;15.28637409 202;15.26544094 203;15.36591530 204;15.36256886 205;15.44307899 206;15.44100189 207;15.42086792 208;15.42072582 209;15.40133858 210;15.39699268 211;15.37883759 212;15.38002968 213;15.35557270 214;15.34215069 215;15.34739399 216;15.30943966 217;15.31679058 218;15.30736256 219;15.32019234 220;15.31537342 221;15.34041023 222;15.33585072 223;15.36312389 224;15.35265827 225;15.37913990 226;15.37622547 227;15.38356686 228;15.38335896 229;15.35412312 230;15.35871029 231;15.34451008 232;15.34000587 233;15.31229877 234;15.31713104 235;15.30744457 236;15.31398964 237;15.29537106 238;15.27291107 239;15.24408245 240;15.22311687 241;15.18539715 242;15.15250301 243;15.07221413 244;15.07678223 245;15.08503723 246;15.07588482 247;14.99395275 248;14.99736977 249;14.92547798 250;14.92432308 251;14.83631897 252;14.79817867 253;14.69709301 254;14.66110134 255;14.55668831 256;14.50823116 257;14.39898968 258;14.37789059 259;14.31869793 260;14.30921650 261;14.23374081 262;14.18931293 263;14.10372639 264;14.04652882 265;13.98878479 266;13.88976765 267;13.82810879 268;13.74412727 269;13.53759766 270;13.45942974 271;13.24135113 272;13.22745991 273;13.24192238 274;0.69052577 275;0.53302395
-- A, ppl: 4.92154110 acc: 0.6242 0.7357 0.7889 0.8222 0.8445 -- B, ppl: 37.95624100 acc: 0.4750 0.5629 0.6078 0.6360 0.6548 -- Top-K agreement: 0.6504 0.3863 0.1917 0.0812 0.0323 -- KL divergence: 2.39380025 -- MSE: 0.00000518
</details>
### Questions
1. Could the higher perplexity observed during quantization with the pippa dataset be attributed to different versions of exllamav2? I plan to download and use the code from around December 20, 2023, to reattempt quantization and see if it improves the perplexity of the quantized model.
2. Have there been any significant updates or changes to convert.py since December 20, 2023?
3. What might be the reasons behind, for the result of model_diff.py, the substantial error increase around layer 35 and the minor fluctuations observed after layer 70?
I appreciate your time and look forward to your insights and suggestions!
- A friend of mine recently shared that his rpcal model exhibited a considerably lower perplexity value (around 8), despite his saying using identical commands, parameters, and dataset as mine. The notable difference is the version of exllamav2 used during his quantization around December 20, 2023.
- Could the higher perplexity observed during quantization with the pippa dataset be attributed to different versions of exllamav2? I plan to download and use the code from around December 20, 2023, to reattempt quantization and see if it improves the perplexity of the quantized model.
I've tried the commit version 162fc5d. However, there were no changes in either the measurements or the perplexity for the wikitext.
# with commit version 162fc5d
{
"measurement": {
"model.layers.0.self_attn": [
{
"accuracy": 0.985510528087616,
"total_bits": 319709184,
# with commit version 26ffee3
{
"measurement": {
"model.layers.0.self_attn": [
{
"accuracy": 0.985510528087616,
"total_bits": 319709184,
# Finishing the quantization
-- Module quantized, calibration perplexity (quant): 7.5700
# Perplexity with wikitext-103-v1
(exllamav2) PS E:\Download\exllamav2-162fc5d62c6d329f8492a8ab8424e5ad05da3dbb> python.exe .\test_inference.py -m F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2-20240107-oldversion\ -ed F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet -gs 18,24
-- Model: F:\llm-models-exl2\goliath-120b-rpcal-2.4bpw-h6-exl2-20240107-oldversion\
-- Options: ['gpu_split: 18,24', 'rope_scale: 1.0', 'rope_alpha: 1.0']
-- Loading model...
-- Loading tokenizer...
-- Running perplexity test
-- Dataset: F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet
-- Tokenizing eval data, 128 rows x 2048 tokens...
-- First 50 tokens of dataset:
' = Robert Boulter = \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
-- Last 50 tokens of dataset:
'was secured by community activists for the first time on 5 January 1969 following an incursion into the <unk> by members of the Royal Ulster Constabulary ( RUC ) . Residents built barric'
-- Inference.............
-- Evaluation perplexity: 178.4219
I also came across Panchovix's quantization which utilized the same dataset for calibration. His quantization doesn't seem to suffer from the high perplexity that I'm experiencing, which adds to my confusion.
(exllamav2) PS E:\Download\exllamav2-162fc5d62c6d329f8492a8ab8424e5ad05da3dbb> python.exe .\test_inference.py -m F:\llm-models-exl2\Panchovix_goliath-120b-exl2-rpcal_3bpw\ -ed F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet -gs 22.8,24
-- Model: F:\llm-models-exl2\Panchovix_goliath-120b-exl2-rpcal_3bpw\
-- Options: ['gpu_split: 22.8,24', 'rope_scale: 1.0', 'rope_alpha: 1.0']
-- Loading model...
-- Loading tokenizer...
-- Running perplexity test
-- Dataset: F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet
-- Tokenizing eval data, 128 rows x 2048 tokens...
-- First 50 tokens of dataset:
' = Robert Boulter = \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
-- Last 50 tokens of dataset:
'was secured by community activists for the first time on 5 January 1969 following an incursion into the <unk> by members of the Royal Ulster Constabulary ( RUC ) . Residents built barric'
-- Inference.............
-- Evaluation perplexity: 12.7348
Given that my methods for quantizing the model with the dataset appears correct, I'm puzzled as to why my results differ significantly. Additionally, I've verified the functionality of my GPU: it successfully quantizes models when not using a specified dataset.
Could there be an underlying factor I'm overlooking that might account for this discrepancy in perplexity?
I'll look into this a little more later, but is it possible your Pippa dataset is bad? I'm also suspecting possible character encoding issues related to Windows. I'll be in a better position to test on Windows soon, but in the meantime I guess I can try converting with Pippa later today to see if I get the same behavior on Linux. Do you have a link to where you got your exact copy of the parquet file?
I'll look into this a little more later, but is it possible your Pippa dataset is bad? I'm also suspecting possible character encoding issues related to Windows. I'll be in a better position to test on Windows soon, but in the meantime I guess I can try converting with Pippa later today to see if I get the same behavior on Linux. Do you have a link to where you got your exact copy of the parquet file?
The PIPPA parquet file is from huggingface royallab/PIPPA-cleaned. I have also thought of this possibility but the SHA256 checksum keeps the same E3792FFD85EBB51B05F7636E54F67CB64239D980C6FB29E888BE744E286FF997
.
Sorry to trouble you but any progress?
I've had some delays upgrading my PC, so everything's been a little disassembled recently and I haven't had a chance to try converting this model again. It basically prevents me from working on anything else for several hours, so I'm going to try to fit that in somewhere, but I'm not sure when exactly.
It'll also be a little while longer before I have a Windows PC set up, and it really looks like it's a Windows-specific issue you're running into. It could very well have to do with character encoding.
No worries. I would try to perform the quantization in wsl to check whether it could operate successfully. Later I would you know the update.
Here's the quantized version in WSL, but the perplexity checked in Windows is still high at 34. :(
I should note that my Windows system is set to CJK, and I have enabled the option Beta: Use Unicode UTF-8 for worldwide language support
.
I'll also give it a try in a Windows virtual machine with an English (EN) locale when I have the time.
(exllamav2) PS F:\exllamav2> python.exe .\test_inference.py -m F:\llm-models-exl2\goliath-120b-rpcal-2.65bpw-h6-exl2-20240118\ -ed F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet -gs 19,24
-- Model: F:\llm-models-exl2\goliath-120b-rpcal-2.65bpw-h6-exl2-20240118\
-- Options: ['gpu_split: 19,24']
-- Loading model...
-- Loading tokenizer...
-- Running perplexity test
-- Dataset: F:\wikitext\wikitext-103-v1\test-00000-of-00001.parquet
-- Tokenizing eval data, 128 rows x 2048 tokens...
-- First 50 tokens of dataset:
' = Robert Boulter = \n Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed'
-- Last 50 tokens of dataset:
'was secured by community activists for the first time on 5 January 1969 following an incursion into the <unk> by members of the Royal Ulster Constabulary ( RUC ) . Residents built barric'
-- Inference.............
-- Evaluation perplexity: 34.6642
Finally, some good news!
The primary reason for the calibration failure is likely not related to the version of ExLlamaV2 or the Windows/Linux encoding, but rather to the dataset.
My entire checking process for all three possibilities:
ExLlamaV2 Version | Windows/Linux | Dataset | Perplexity |
---|---|---|---|
old commit 162fc5d | Windows (zh-cn locale) | royallab/PIPPA-cleaned | ×, 30+ |
new commit 26ffee3 | Windows (zh-cn locale) | royallab/PIPPA-cleaned | ×, 30+ |
new commit 26ffee3 | Windows (zh-cn locale) | royallab/PIPPA-cleaned | ×, 30+ |
new commit 26ffee3 | WSL in Windows (zh-cn locale) | royallab/PIPPA-cleaned | ×, 30+ |
new commit 26ffee3 | Windows (en-us locale) | royallab/PIPPA-cleaned | ×, 30+ |
new commit 26ffee3 | Windows (zh-cn locale) | royallab/PIPPA-cleaned | ×, 30+ |
new commit 26ffee3 | Windows (zh-cn locale) | VatsaDev/worldbuild | √, 7.4 |
However, I'm puzzled about why the parquet file from royallab/PIPPA-cleaned yields unsatisfactory results. To complement the parquet I use is the one provided directly in the hugging face repository, rather then the one converted by the bot in branch refs/convert/parquet
. Therefore it leads me to speculate that the problematic parquet file might be the one provided directly, especially since the parquet file from VatsaDev/worldbuild that operates normally is converted from .jsonl by the bot. I plan to revisit this issue when I have more time.
Anyway I am happy that two weeks' efforts are not in vain. I once again appreciate your detailed explanation and your ongoing concern regarding this issue!
Just finished the calibration with the dataset of PIPPA converted from the bot, still no luck as Evaluation perplexity: 46.7529
. Maybe I just have no fortune for this dataset man. :(
But anyway I would use the worldbuild dataset to do the calibration from now on.
Closing the issue!
Context and Issue
I'm attempting to quantize the model alpindale/goliath-120b using royallab/PIPPA-cleaned for role-play applications, employing exllamav2 (commit 26ffee3). The quantization is done in 2 steps:
The quantization process is smooth, with the calibration perplexity reported as
Module quantized, calibration perplexity (quant): 7.5812
. However, recalculating perplexity with a dataset like wikitext yields significantly higher values.Perplexity only remains close to 7.5812 when using the original dataset as 8.4998.
And comparisons between the original and quantized models also lead to errors.
Inquiries
Quantization Correctness: Is my approach to dataset-specific calibrated quantization appropriate? If not, could you guide me on the correct method?
Perplexity Variance: I understand why perplexity might be low with the specified dataset as it is calibrated to, but why does it go that large with wikitext? Is the dataset-specific calibrated quantization supposed to be that?
Calibration Mechanism: Could you provide a brief explanation of the underlying mechanics of dataset calibration? And if the observed behavior is expected, why so?
Default Calibration Dataset: What is the default calibration dataset used? For perplexity calculation with wikitext, which subset should I utilize: test, train, or validation?
Unknown Parameters: In some online code snippets,
I've encountered parameters like
-mr
and-gr
which aren't documented in the manual. Could you explain their functions and usage, possibly with examples?Additional Parameters in text_inference.py: Are there hidden parameters similar to
-mr
and-gr
in text_inference.py?Comparison Errors: What might be causing the errors during model comparison, and how can I resolve them?
Attempts and Documentation
My environment: Windows 11 23H2, Python 3.11.7, Torch 2.1.2, CUDA 12.1, nvcc V12.3.103.
I have verified the original alpindale/goliath-120b model integrity via SHA256 checksums.
I have carefully checked the document on the page https://github.com/turboderp/exllamav2/blob/master/doc/convert.md, but found no examples for calibration with a specific dataset. I could only know that parameter
-c
is used to pass the calibration dataset.I have also attempted to quantize the model only in one step.
The perplexity given in the log of quantization is also good:
Module quantized, calibration perplexity (quant): 7.6202
, but when I calculate the perplexity with wikitext dataset, sadly the result is as bad as mentioned before.I have also checked repository issues (#246, #130, #129, #83, #44) and scoured the internet for detailed calibration instructions to no avail.
Here is also my measurement.json for potential insight. goliath-120b-rpcal-measurement.json
Therefore I feel that I need your assistance. I deeply appreciate your time and assistance in addressing these queries and issues.