microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.64k stars 4.14k forks source link

ZeroQuant not compressing and making BERT slower #2239

Open K2triinK opened 2 years ago

K2triinK commented 2 years ago

Describe the bug I was expecting a compressed & faster BERT model after running the BERT ZeroQuant example in DeepSpeedExamples. However, the clean model isn't any smaller (still 417.7 MB) or faster (in fact, it's slower) than the original.

To Reproduce

  1. Go to Google Colab and change to GPU runtime
  2. Run the following: pip install deepspeed==0.7.0 git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples/model_compression/bert (In the zero_quant.sh file, change master_port (e.g. to 9995) and task to sst2 & eval_batch_size to 32 (otherwise you'll get CUDA out of memory)) bash bash_script/ZeroQuant/zero_quant.sh

Expected behavior I expected the final clean model to be a compressed version of the original one, thus being smaller & faster but it isn't.

ds_report output image

System info (please complete the following information):

RezaYazdaniAminabadi commented 2 years ago

Hey @K2triinK

I am wrapping up this PR which answers some part of your questions, such as the model size reduction. Regarding the kernels, we are working on a plan to release it soon so that you can give it a try. Thanks, Reza