kssteven418 / I-BERT

[ICML'21 Oral] I-BERT: Integer-only BERT Quantization
https://arxiv.org/abs/2101.01321
MIT License
226 stars 32 forks source link

Can use the CPU in the inference state? #1

Open luoling1993 opened 3 years ago

luoling1993 commented 3 years ago

Excellent work!

Can use the CPU in the inference state? And how much faster than baseline?

kssteven418 commented 3 years ago

Thanks for your interest! I should first mention that this PyTorch implementation of I-BERT only searches for the integer parameters (i.e., performs quantization-aware-training) that minimize the accuracy degradation as compared to the full-precision counterpart. As far as I know, PyTorch does not support integer operations (unless using its own quantization library, whose functionality is very limited) and thus the current PyTorch implementation does not achieve latency reduction on real hardware by itself. In order to deploy I-BERT on GPU or CPU and achieve speedup, you should additionally export the integer parameters (which are obtained from this implementation) along with the model architecture to other frameworks that support deployment on integer processing units. TVM and TensorRT are such examples.

Hope this answers your question!