Large performance drop after using quantizer to quantize Transformer model.

SefaZeng commented 3 years ago

Describe the issue: I try to use all the four example quantizer to test the compression function. I apply these methods to a NMT model from OpenNMT and the BLEU score is 27.0 before quantization and 0.0 after it. The quantized model output same result for all the whole testset. Is there anything wrong with my code? And could you please provide an transformer based example wich I thought will be helpful for NLP reseachers.

Environment:

NNI version: 2.3
Training service (local|remote|pai|aml|etc):
Client OS:
Server OS (for remote mode only):
Python version: 3.7
PyTorch/TensorFlow version: 1.4
Is conda/virtualenv/venv used?: yes
Is running in Docker?: no

Configuration:

Experiment config (remember to remove secrets!):
Search space:

Log message:

nnimanager.log:
dispatcher.log:
nnictl stdout and stderr:

How to reproduce it?: code here:

#coding:utf-8                                                                                                           
import torch
import nni
from nni.algorithms.compression.pytorch.quantization import NaiveQuantizer, QAT_Quantizer, DoReFaQuantizer, BNNQuantizer
import sys
import onmt.inputters as inputters
import onmt.modules
from onmt.encoders import str2enc

from onmt.decoders import str2dec

from onmt.modules import Embeddings, VecEmbedding, CopyGenerator
from onmt.modules.util_class import Cast
from onmt.utils.misc import use_gpu
from onmt.utils.logging import logger
from onmt.utils.parse import ArgumentParser
from onmt.model_builder import build_base_model

config_list = [{
    'quant_types': ['weight'],
    'quant_bits': {
        'weight': 8,
    }, # 这里可以仅使用 `int`，因为所有 `quan_types` 使用了一样的位长，参考下方 `ReLu6` 配置。
    'op_types':['Linear']
}, #{
   # 'quant_types': ['output'],
   # 'quant_bits': 8,
   # 'quant_start_step': 7000,
   # 'op_types':['ReLU6']
#}
]
derefa_config_list = [{
    'quant_types': ['weight'],
    'quant_bits': 8,
    'op_types': ['default'],
}]

bnn_config_list = [{
    'quant_bits': 1,
    'quant_types': ['weight'],
    'op_types': ['Linear'],
}, #{
   # 'quant_bits': 1,
   # 'quant_types': ['output'],
   # 'op_types': ['Linear'],
#}
]
model_path="model.pt"
checkpoint = torch.load(model_path,
            map_location=lambda storage, loc: storage)

model_opt = ArgumentParser.ckpt_model_opts(checkpoint['opt'])
ArgumentParser.update_model_opts(model_opt)
ArgumentParser.validate_model_opts(model_opt)
vocab = checkpoint['vocab']
if inputters.old_style_vocab(vocab):
    fields = inputters.load_old_vocab(
        vocab, opt.data_type, dynamic_dict=model_opt.copy_attn
    )   
else:
    fields = vocab

use_gpu=True
gpu=0
fp32=False
model = build_base_model(model_opt, fields, use_gpu, checkpoint,
             gpu)
if fp32:
    model.float()
model.eval()
#print(model)
#print(checkpoint['generator'])                                                                                         

print(type(model))
naive_quantizer = NaiveQuantizer(model, naive_config_list)
qat_quantizer = QAT_Quantizer(model, config_list)
derefa_quant = DoReFaQuantizer(model, derefa_config_list)
bnn_quant = BNNQuantizer(model, bnn_config_list)

naive_model = naive_quantizer.compress()
QAT_model = qat_quantizer.compress()
derefa_model = derefa_quant.compress()
bnn_model = bnn_quant.compress()

#torch.save(model.state_dict(), "quantized_model.pth")
checkpoint['model'] = naive_model.state_dict()
torch.save(checkpoint, "naive_model.pt")
checkpoint['model'] = QAT_model.state_dict()
torch.save(checkpoint, "QAT_model.pt")
checkpoint['model'] = derefa_model.state_dict()
torch.save(checkpoint, "derefa_model.pt")
checkpoint['model'] = bnn_model.state_dict()
torch.save(checkpoint, "bnn_model.pt")

linbinskn commented 3 years ago

Have you finetune model after quantizer.compress()? If not, it it normal that accuracy would drop a lot for two reasons:

Quantization parameters of activation are init by default which would lead to great accuracy drop without finetuning.
Finetuning algorithm itself like QAT can help fix quantization error and always get better accuracy than post-training quantization(quantize model without finetuning).

Quantizing transformer model will be an important feature in NNI model compression framework. Please look forward to the related examples in the following releases.

SefaZeng commented 3 years ago

Have you finetune model after quantizer.compress()? If not, it it normal that accuracy would drop a lot for two reasons:

Quantization parameters of activation are init by default which would lead to great accuracy drop without finetuning.

Finetuning algorithm itself like QAT can help fix quantization error and always get better accuracy than post-training quantization(quantize model without finetuning).

Quantizing transformer model will be an important feature in NNI model compression framework. Please look forward to the related examples in the following releases.

Thank you so much for your reply! I am wondering if you guys have ever tested the all quantization methods effect in the Transformer model?

chenbohua3 commented 3 years ago

@SefaZeng we have tested QAT and LSQ quantizer on transformer models (used in our production) and they behave well. There are many tricks for quantizing them such as:

use a post-training quantizer to calculate the initial scale for qat quantizer. (This method are important)
the learning rate should equal to or less than that used in the final stage of training. and so on

We are working on improving the qat/lsq quantizer for production use. Codes will be ready at the end of this nni iteration

SefaZeng commented 3 years ago

@SefaZeng we have tested QAT and LSQ quantizer on transformer models (used in our production) and they behave well. There are many tricks for quantizing them such as:

use a post-training quantizer to calculate the initial scale for qat quantizer. (This method are important)

the learning rate should equal to or less than that used in the final stage of training. and so on

We are working on improving the qat/lsq quantizer for production use. Codes will be ready at the end of this nni iteration

Hi @chenbohua3 . Thanks for your reply. By now, is there any code example for these methods you mentioned above? Or maybe I need to wait for the latest iteration.

Lijiaoa commented 2 years ago

hi @chenbohua3 Thanks for your great work in quantization. And @SefaZeng Do you still have this issue in latest nni(v2.9)? Expect you reply, thanks.

ashutosh96 commented 1 year ago

@SefaZeng we have tested QAT and LSQ quantizer on transformer models (used in our production) and they behave well. There are many tricks for quantizing them such as:

use a post-training quantizer to calculate the initial scale for qat quantizer. (This method are important)

the learning rate should equal to or less than that used in the final stage of training. and so on

We are working on improving the qat/lsq quantizer for production use. Codes will be ready at the end of this nni iteration

Hi @chenbohua3, I have a question regarding your first suggestion. How to test post training quantization to various precisions using the NNI API? The Naive Quantizer only allows setting the precision to 8bits, and QAT Quantizer reports big drop in accuracy before retraining. So both these methods don't work if I want to test PTQ precision range [1,32]. I've also tried using the QAT quantizer for several models, and the loss for many models like ShuffleNetV2 always diverges to NAN, regardless of the learning rate/momentum settings for the SGD optimizer. Do you have any suggestions about what to do in this case? Any help would be greatly appreciated. You are building a great library for DNN Compression!

microsoft / nni

Large performance drop after using quantizer to quantize Transformer model. #3918