mit-han-lab / lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention
https://arxiv.org/abs/2004.11886
Other
596 stars 81 forks source link

Model Compression #4

Closed kalyangvs closed 4 years ago

kalyangvs commented 4 years ago

Hi, Can you please provide the code used to compress the model by 18.2 X using pruning and quantization. Thanks.

kalyangvs commented 4 years ago

@Michaelvll Does the quant plus pruning model include other data such as last_optimizer_state, optimizer_history etc..

Michaelvll commented 4 years ago

Thank you for asking! We are still cleaning the code for compression. We quantized the model parameters to 8 bits and sensitive prune the model with NervanaSystems/distiller. We only calculated the model size since the optimizer states are not used in inference.

usamec commented 4 years ago

@Michaelvll can you please provide distiller config, which was used?

Also do you prune individual weights, or whole channels/filters/heads?

Michaelvll commented 4 years ago

For simplicity, we use sensitivity pruning for our model, which is fine-grained pruning, i.e. pruning the individual weights. You can try on the configuration for the WMT En-Fr model with 527M #Multi-Adds.

zilunpeng commented 3 years ago

Could you share some more information on how you quantize the model? Did you use NervanaSystems/distiller for quantization?