Open WZMIAOMIAO opened 2 years ago
@WZMIAOMIAO Quantizing models used to be a more involved process as Eager Mode Quantization often requires overwriting the class to introduce additional Quant/DeQuant stubs and structuring your code in a way that elements can be easily replaced (using nn modules instead of functional etc). The PyTorch Core team is working on a series of APIs that would make the quantization of models an easier process. FX Graph mode quantization is a new API that would allow you to do this, see #5797 for details. The work on the API is still in progress and might take an extra bit to complete but it's worth keeping an eye on it as it might simplify most of the past complexities.
@andrewor14 Do you have any idea how close are we to finalizing the API and reopening the above PR?
Hi @datumbox, the FX graph mode quantization API should more or less be finalized at this point, cc'ing @jerryzh168 just to confirm. However, due to recent priority shifts I no longer have the bandwidth to continue this work. I do think it's close to being done and it would be worth it to finish the remaining work. If it is high priority I can check with the team to see if we can prioritize this.
@andrewor14 We would definitely like to complete the work, expose the new API to users and build trust on the solution. I appreciate though that your might be limited by the bandwidth. Please do check out with your team and let me know if that's something we could tackle in Q4.
@datumbox well, FX Graph Mode Quantization tool is so convenient. But I find a bug for quantize mobilenetv2.
If set torch.backends.quantized.engine='qnnpack'
, the acc of quantized mobilenetv2 reduced to 0 after convert_fx model, and evaluation is too slow whether ptq or qat. Some codes are as follows:
import torch
import torch.ao.quantization.quantize_fx as quantize_fx
import torchvision
torch.backends.quantized.engine = 'qnnpack'
model = torchvision.models.mobilenet_v2(MobileNet_V2_Weights.IMAGENET1K_V1)
model.to('cuda')
model.eval()
qconfig_dict = {"": torch.ao.quantization.get_default_qconfig('qnnpack')}
model_prepared = quantize_fx.prepare(model, qconfig_dict)
# calibration...
quantized_model = quantize_fx.convert_fx(model_prepared)
# evaluate acc, acc: 0.0
But if I delete torch.backends.quantized.engine='qnnpack'
, acc of quantized mobilenet is about 0.71 and speed of evaluation is faster.
env:
torch 1.12.1
GPU A100
@andrewor14 Any thoughts on @WZMIAOMIAO above bug report? Should he open an issue on Core to review?
🚀 The feature
hi, thanks for your great work. I hope to be able to add quantized vit model (for ptq or qat).
Motivation, pitch
In 'torchvision/models/quantization', there are several quantized model (Eager Mode Quantization) that is very useful for me to learn quantization. In recent years, Transformer model is very popular. I want to learn how to quantized Transformer model, e.g Vision Transformer, Swin Transformer etc, using pytorch official tools like Eager Mode Quantization. I also tried to modify it myself, but failed. I don't know how to quantify 'pos_embedding' (nn.Parameter) and nn.MultiheadAttention module. look forward to your reply.
Alternatives
No response
Additional context
No response