Thanks for your great work! But I'm a bit puzzled.
I run the smoothquant_opt_demo.ipynb and get three model size of model_FP16, model_w8a8 and model_smoothquant_w8a8 with print_model_size(model) from smoothquant_opt_real_int8_demo.ipynb.
I know that you simulate INT8 inference with FP16 with fake_quant.py, so the model size should be same as modelFP16.
But when I change `w.div(scales).round().mul(scales)tow.div(scales).round()infake_quant.py`, I got the similar model size.
I'm confused about how 'fake_quant.py' works and how to achieve real INT8?
looking forward to hearing from you soon. 🤩
Thanks for your great work! But I'm a bit puzzled. I run the
smoothquant_opt_demo.ipynb
and get three model size ofmodel_FP16
,model_w8a8
andmodel_smoothquant_w8a8
withprint_model_size(model)
fromsmoothquant_opt_real_int8_demo.ipynb
. I know that you simulate INT8 inference with FP16 withfake_quant.py
, so the model size should be same as modelFP16. But when I change `w.div(scales).round().mul(scales)to
w.div(scales).round()in
fake_quant.py`, I got the similar model size. I'm confused about how 'fake_quant.py' works and how to achieve real INT8? looking forward to hearing from you soon. 🤩