Why use the 8-bit floating numbers to compute the original cost?

mit-han-lab / haq

[CVPR 2019, Oral] HAQ: Hardware-Aware Automated Quantization with Mixed Precision

https://hanlab.mit.edu/projects/haq/

MIT License

369 stars 86 forks source link

Why use the 8-bit floating numbers to compute the original cost? #16

Open shiyuetianqiang opened 4 years ago

shiyuetianqiang commented 4 years ago

Hi, The work is amazing. When I looked through the code, I foud that you employed the 8-bit floating numbers to compute the original cost and store it as a lookup table. I wondered why not use the 32-bit floating(not use the flag "--half" in the pretraining process) or use the 16-bit floating (use the flag "--half" in the pretraining process)? Could you please clarify that? Thanks a lot!

shiyuetianqiang commented 4 years ago

Sorry, I got it. It seems that you employed the 8-bit as the baseline