intel / auto-round

Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"
https://arxiv.org/abs/2309.05516
Apache License 2.0
245 stars 20 forks source link

from auto_round import AutoRoundConfig #235

Closed CrispStrobe closed 1 week ago

CrispStrobe commented 2 months ago

... did not work for me right now, whereas previously it did. Cannot check this further at the moment, but maybe you might want to. Environment was kaggle and colab, occured after !pip install auto-round, seemingly resolved per:

!pip uninstall auto_round -y
!pip cache remove auto_round
!pip install --no-cache-dir auto_round
from auto_round.auto_quantizer import AutoRoundConfig
wenhuach21 commented 2 months ago

Sorry for the confusion. The AutoRoundConfig was introduced after version 0.3.0. We'll update the documentation to clarify this. In the meantime, you can install the latest version from the source

 git clone https://github.com/intel/auto-round.git && cd auto-round && pip install -vvv --no-build-isolation -e ..

For version 0.3.0, we recommend using:

from auto_round.auto_quantizer import AutoHfQuantizer

In this version, the device is automatically chosen with GPU and HPU taking priority over CPU. To use the CPU on a CUDA machine, you'll need to modify the model's configuration file.

If you're working with CUDA, you'll need to install from the source to compile the kernel, as we couldn't include the it in the package for various reason

wenhuach21 commented 2 months ago

updated readme, also the introduction of format you might be interested. Additionally, we plan to release a new version next month, which will address this issue.

CrispStrobe commented 2 months ago

many thanks, perfect - and wow, that was swift!

sahibpreetsingh12 commented 1 month ago

Is the issue still there because I am still getting the issue.

wenhuach21 commented 1 month ago

Apologies for the delay in the release. In the meantime, please use the following import statement for version v0.3

from auto_round.auto_quantizer import AutoHfQuantizer
sahibpreetsingh12 commented 1 month ago

yes thanks @wenhuach21 the issue i found is While quantisation it does it with 'auto-roun' format but during inference since we don't have support for that it causes the issue. Do correct me if i am wrong ?

wenhuach21 commented 1 month ago

In quantization, the model operates in floating-point format and undergoes fake quantization to simulate the quantization behavior. After the tuning process is complete, this fake model is converted into a true int4 model that adheres to your specified format. So for real inference, we need to import that code for auto-round format or install auto_gptq for gptq format.

Please refer to the Model Inference in README for more details.

sahibpreetsingh12 commented 1 month ago

Sure I installed optimum and auto-gptq but still getting ImportError: Loading a GPTQ quantized model requires optimum (pip install optimum) and auto-gptq library (pip install auto-gptq) I am on a Kaggle kernel so can't refresh

wenhuach21 commented 1 month ago

Sure I installed optimum and auto-gptq but still getting ImportError: Loading a GPTQ quantized model requires optimum (pip install optimum) and auto-gptq library (pip install auto-gptq) I am on a Kaggle kernel so can't refresh

Which format are you using, auto_round or auto_gptq?

For the GPTQ format, installing Optimum and Auto-GPTQ should suffice. If Transformers still throws an exception, please check your environment; you might have multiple environments in use.

For the auto_round format, please follow our readme

sahibpreetsingh12 commented 1 month ago

The format is 'auto_gptq' i installed optimum and auto-gptq and auto-round==0.3.0 and for now i am quantising 'facebook/opt-125m' and my kaggle kernel is crashing it's 12GB of P-100 GPU and this is what i am getting

Screenshot 2024-09-29 at 9 44 33 PM
wenhuach21 commented 1 month ago

remove the second line and try again, if it's ok, it seems a bug in our code. It it's not ok, try to inference this model ybelkada/opt-125m-gptq-4bit in your env.

sahibpreetsingh12 commented 1 month ago

I did removed the second line and for my model 'facebook/opt-125m' and i got

Screenshot 2024-09-29 at 9 58 43 PM

but when i did inference with 'ybelkada/opt-125m-gptq-4bit' yes i got results

wenhuach21 commented 1 month ago

Then you export to AutoRound format model rather than AutoGPTQ format. For AutoRound format, you'll need to install it from source with CUDA support. I recommend switching to the Auto-GPTQ format, but please note that it may have accuracy issues with asymmetric quantization.

autoround.save_quantized(output_dir, format='auto_gptq', inplace=True) 
sahibpreetsingh12 commented 1 month ago

Yes I already saved in 'auto-gptq' format

sahibpreetsingh12 commented 1 month ago

But still no inference

wenhuach21 commented 1 month ago

I did removed the second line and for my model 'facebook/opt-125m' and i got Screenshot 2024-09-29 at 9 58 43 PM but when i did inference with 'ybelkada/opt-125m-gptq-4bit' yes i got results

still this issue?

wenhuach21 commented 1 month ago

That's interesting! I assume you're still exporting to the AutoRound format. I ran the following code on version 0.3.0, and it worked fine. Please check the config.json in the quantized model directory; the quant_method should be set to 'gptq' if the format is auto_gptq.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)

autoround.quantize()
output_dir = "./tmp_autoround"
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq'(default in version<=0.3.0), 'auto_awq'
autoround.save_quantized(output_dir, format='auto_gptq', inplace=True) 
sahibpreetsingh12 commented 1 month ago

yes this is working can you send me the version that worked for inference of this same model. Because what is their in documentation is not working for me

wenhuach21 commented 1 month ago

this is the AutoGPTQ format, I just use the same code in readme

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model_path = "./tmp_autoround"
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
                                             device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
sahibpreetsingh12 commented 1 month ago

i dont know how its happening but i did quantized it in 'auto_gptq' format and saved the file in zip and when i upload it on another kaggle notebook 'config.json' spits out 'auto-round' format

sahibpreetsingh12 commented 1 month ago

and finally it worked

sahibpreetsingh12 commented 1 month ago

One another question @wenhuach21 when i am doing the quantisation for phi-2 using GPTQ format i am getting this

Screenshot 2024-09-29 at 11 52 46 PM

and this is my code


from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False

autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym)

autoround.quantize()
output_dir = "./sahib_autorounds_phi2"
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq'(default in version<=0.3.0), 'auto_awq'
autoround.save_quantized(output_dir, format='auto_gptq', inplace=True) `
wenhuach21 commented 1 month ago

I could not reproduce this issue. May I know your transformer version? BTW, for phi-2, you'd better set sym=True due to the kernel issue of GPTQ.

wenhuach21 commented 3 weeks ago

One another question @wenhuach21 when i am doing the quantisation for phi-2 using GPTQ format i am getting this Screenshot 2024-09-29 at 11 52 46 PM and this is my code

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, False

autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym)

autoround.quantize()
output_dir = "./sahib_autorounds_phi2"
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq'(default in version<=0.3.0), 'auto_awq'
autoround.save_quantized(output_dir, format='auto_gptq', inplace=True) `

fixed https://github.com/intel/auto-round/pull/272