Open alex4321 opened 1 year ago
Checked the difference in the the way one linear layer works: https://github.com/alex4321/alpaca_lora_4bit/blob/test-different-faster-modes/test-matmul.ipynb
And, yeah, there are significant MAE between all the modes - disabled / faster / old_faster:
DISABLED-FASTER 1.0654296875
DISABLED-OLD FASTER 0.86083984375
FASTER-OLD FASTER 0.90478515625
DISABLED OUTPUT (5% - 95% quantiles) -2.06591796875 2.02783203125
DISABLED-FASTER 1.0927734375
DISABLED-OLD FASTER 0.93994140625
FASTER-OLD FASTER 0.86962890625
DISABLED OUTPUT (5% - 95% quantiles) -2.06787109375 2.0029296875
DISABLED-FASTER 1.20703125
DISABLED-OLD FASTER 0.9873046875
FASTER-OLD FASTER 0.99951171875
DISABLED OUTPUT (5% - 95% quantiles) -1.97216796875 2.03076171875
DISABLED-FASTER 1.0576171875
DISABLED-OLD FASTER 0.85595703125
FASTER-OLD FASTER 0.86328125
DISABLED OUTPUT (5% - 95% quantiles) -1.88232421875 1.8505859375
DISABLED-FASTER 1.115234375
DISABLED-OLD FASTER 0.98388671875
FASTER-OLD FASTER 0.97265625
DISABLED OUTPUT (5% - 95% quantiles) -1.98876953125 1.958251953125
DISABLED-FASTER 1.1455078125
DISABLED-OLD FASTER 0.87109375
FASTER-OLD FASTER 0.92919921875
DISABLED OUTPUT (5% - 95% quantiles) -2.00439453125 2.01318359375
DISABLED-FASTER 1.19140625
DISABLED-OLD FASTER 0.98779296875
FASTER-OLD FASTER 0.90869140625
DISABLED OUTPUT (5% - 95% quantiles) -1.967041015625 2.01416015625
DISABLED-FASTER 1.025390625
DISABLED-OLD FASTER 0.90966796875
FASTER-OLD FASTER 0.880859375
DISABLED OUTPUT (5% - 95% quantiles) -2.080078125 2.04296875
DISABLED-FASTER 1.0478515625
DISABLED-OLD FASTER 0.9462890625
FASTER-OLD FASTER 0.90869140625
DISABLED OUTPUT (5% - 95% quantiles) -2.04931640625 2.099609375
DISABLED-FASTER 1.0419921875
DISABLED-OLD FASTER 0.94677734375
FASTER-OLD FASTER 0.87158203125
DISABLED OUTPUT (5% - 95% quantiles) -1.9267578125 1.913330078125
So while most of the layer outputs lies within -2.0 ... 2.0 range - the MAE between different methods may be up to 1 (well, not sure it's not expected for quantization, but I doubt we should expect it for different calculation methods?)
Currently faster kernel does not support the model using act-order, because act-order requires random access on qzeros by g_idx. Random access on VRAM would slow down the whole speed for computation so there would be some performance loss.
Also using non-act-order kernel on model with act-order may cause inf or nan.
I think you can compare the result from _matmul4bit_v2_recons and act_order kernel (faster disabled).
Yeah. but in all these cases it's about not-act-order (as well as not-act-order model).
alpaca_lora_4bit.matmul_utils_4bit.act_order = False
Okay, will see the difference
Can't reproduce the issue using fresh setup and latest winglian-setup_pip
branch. So at least it may be recreating the environment using the latest version of winglian-setup_pip
will help to whoever facing the similar issue.
disable 0 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
disable 1 As an AI language model, I don't have personal beliefs or opinions, but I can provide some perspect
disable 2 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
disable 3 As an AI language model, I don't have personal beliefs or opinions, but
The post The Mean
disable 4 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
faster 0 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
faster 1 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
faster 2 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
faster 3 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
faster 4 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
old_faster 0 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
old_faster 1 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
old_faster 2 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
old_faster 3 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
old_faster 4 As an AI language model, I don't have personal beliefs or opinions. However, the meaning of life is
After fixing #124 I continuing debugging my issues.
So I am still using this model: https://huggingface.co/TheBloke/vicuna-7B-GPTQ-4bit-128g
But I were getting gibberish results by default. Like "What is the meaning of life" -> "As you лта :tinsarder,tatdenS-L-one-0"
But since previously I were using old version of this library and after seeing https://github.com/alex4321/alpaca_lora_4bit/blame/winglian-setup_pip/src/alpaca_lora_4bit/matmul_utils_4bit.py
act_order
(which were mentioned in the previous issue) was introduced in one of relatively late updates - I decided to check what other changes (regards "faster_mode") will change.So I made the following notebook: https://github.com/alex4321/alpaca_lora_4bit/blob/test-different-faster-modes/test.ipynb
And it seems like (in my setup) non-disabled faster_mode gives me gibberish results (with this model).
p.s. I did not checked Linux environments such as Colab yet, will probably do it later as well as diving into difference between algorithms - such as should it give me exactly the same result or not and so.