Quantization on trained model

kssteven418 / I-BERT

[ICML'21 Oral] I-BERT: Integer-only BERT Quantization

https://arxiv.org/abs/2101.01321

MIT License

219 stars 31 forks source link

Quantization on trained model #6

Open shon-otmazgin opened 3 years ago

shon-otmazgin commented 3 years ago

❓ Questions and Help

Hello, Great paper! kudos! After reading I was wondering if it is possible to use these quantization methods on trained model using one of huggingface transformers or shall we re-train the model and use I-BERT?

kssteven418 commented 3 years ago

Thanks for your interest!

First of all, HF and Fairseq (the current repo) are two different implementations for I-BERT and are independent from each others. You can use one of them. In either way, you start with a pre-trained RoBERTa model (we do not currently support other models, but you will be able to easily implement on whatever your target model by referring to our implementation!), which you have to finetune on your target downstream task (e.g., MNLI, Squad, etc). After that, you can quantize the model and recover accuracy via quantization aware retraining. That is to say, there is no checkpoint provided for quantized models.

Hope this answers your question, and please let me know if it doesn't.

shon-otmazgin commented 3 years ago

Hello @kssteven418, OK, I want to finetune it on my custom task it is possible ? or only GLUE datasets are supported?

kssteven418 commented 3 years ago

It is not restricted to specific tasks, so you can finetune it on your own task.

shon-otmazgin commented 3 years ago

Let me rephrase my question. Basically what I am trying to do is quantize each layer in my pretrained model with your suggested quant modules. For simplicity lets try to quantize only nn.Linear layer for now.

@kssteven418 can you give me a hint where to look at? and how to convert these layers? lets ignore quantization-aware-finetuning I want to see the accuracy degradation while inference speed increasing.

My task is about fast coreference resolution and combined it with quantization may makes it practical to use.

Thanks !

bdalal commented 3 years ago

You can use it on any model. I'm currently evaluating applying the quantized modules to distilbert from HF and so far it seems to be working. You essentially need to replace the various layers with their QAT counterparts and then make sure that your activations are correctly requantized where needed (which can be found from the paper or the IBert code).

shon-otmazgin commented 3 years ago

@bdalal here is my example:

from fairseq.quantization.utils.quant_modules import QuantLinear

linear = model.layer1

qlinear = QuantLinear(weight_bit=8, bias_bit=8, quant_mode='symmetric')
qlinear.set_param(linear)

now I have QuantLinear. what i cant understand is when using forward i need to send prev_act_scaling_factor.

@kssteven418 what is does ? and what i should send there ?

bdalal commented 3 years ago

You'd need to start with the embedding layer. The way I did in it in HF was to just pull the disitlbert code next to their IBert code and then replace every EMbedding, Linear, Softmax, GELU and LayerNorm layers with their corresponding quantized modules. Not sure if this helps. I'd suggest looking at their HF code because it's much easier to understand how the QAT works there.

shon-otmazgin commented 3 years ago

@bdalal Can you share your idistlbett?

bdalal commented 3 years ago

I'll be pushing it to github early next week and I'll share the link once I do.

bdalal commented 3 years ago

@shon-otmazgin I've pushed my impl. You can find that here

There's some instability during training but I haven't gotten around to troubleshooting it.