ZeroQuant not compressing and making BERT slower

Describe the bug I was expecting a compressed & faster BERT model after running the BERT ZeroQuant example in DeepSpeedExamples. However, the clean model isn't any smaller (still 417.7 MB) or faster (in fact, it's slower) than the original.

To Reproduce

Go to Google Colab and change to GPU runtime
Run the following: pip install deepspeed==0.7.0 git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples/model_compression/bert (In the zero_quant.sh file, change master_port (e.g. to 9995) and task to sst2 & eval_batch_size to 32 (otherwise you'll get CUDA out of memory)) bash bash_script/ZeroQuant/zero_quant.sh

Expected behavior I expected the final clean model to be a compressed version of the original one, thus being smaller & faster but it isn't.

ds_report output

System info (please complete the following information):

OS: Ubuntu 18.04.6 LTS
1 Tesla T4 GPU
Tried with both 3.7.13 and 3.9

microsoft / DeepSpeed

ZeroQuant not compressing and making BERT slower #2239