Open ClaraLovesFunk opened 2 weeks ago
I just tested your example file quantize_sst2_model.py and printed the parameters of the reloaded model and also there all the parameters are still in float32.
for name, param in model_reloaded.named_parameters():
print(f"Parameter: {name}, Data Type: {param.dtype}")
Float model 872 sentences evaluated in 2.08 s. accuracy = 0.9105504587155964 Calibrating ... 872 sentences evaluated in 3.12 s. accuracy = 0.893348623853211 Quantized model (w: quanto.qint8, a: quanto.qint8) 872 sentences evaluated in 1.85 s. accuracy = 0.8979357798165137
@ClaraLovesFunk thank you for your feedback. The parameters dtype is still float32, but if you check their type, you will see that they are now QTensor
subtypes instead of Tensor
. QTensor
subtypes preserve the external dtype but their internal data is quantized. You can check the qtype
property to verify if it is correct.
Thank you so much for the explanation, David! Will do.
Do you maybe also have an explanation, why i can't use bigger batch sizes after applying quantization and veryfing my model shrinked from 413.44 to 169.11 MB?
Dear quanto folks,
I implemented quantization as suggested in your coding example quantize_sst2_model.py. When printing the datatypes of the parameters, I found that after quantization all the weights remained in float32. Do you have any explaination to this?
And also do you have any explainations, why i can't use bigger batch sizes when applying quantization of both weights and activations? I used PubMedBERT for Huggingface, fine-tuned it myself and applied static quantization (see code below).
And do you know why inference speed significantly slows down when i use the reloaded statically quantized model (code below) as opposed to the directly statically quantized model? I again followed the instructions of the coding example
Any help greatly appreciated since I'm just wrapping up my soon due master thesis about this <3 Clara
Direct Static Quantiation:
Reloading statically quantized model: