pytorch / glow

Compiler for Neural Network hardware accelerators
Apache License 2.0
3.22k stars 690 forks source link

[Quantization] MatMul result precision #4772

Open psyhtest opened 4 years ago

psyhtest commented 4 years ago

When inspecting a quantized graph for a simple FC network, I was a bit surprised that the result of MatMul nodes was coerced into i8, possibly resulting in a huge accuracy loss. I would have though that the result of MatMul should remain as i32, especially if an i32 bias is to be added. What am I missing here?

It may be easy enough to update the result type of MatMul nodes and propagate the change down the graph, but I'm not sure how this would play with collecting profiling info. Any hints would be greatly appreciated!

/cc: @mciprian13

mciprian13 commented 4 years ago

@psyhtest Most of the quantized implementations for the FullyConnected I have seen do have an int8 output precision and not int32. The actual problem with the current FullyConnected implementation in Glow for the CPU backend is that the FullyConnected is computed using two layers: MatMul and BatchAdd. The problem here is that the precision is reduced to int8 before adding the int32 bias. A more accurate implementation means having an atomic implementation where the int32 bias is added in the int32 accumulator of the MatMul before going to the final int8 precision. This problem has been expressed multiple times and there are other issues suggesting for an atomic implementation of the FullyConnected.

psyhtest commented 4 years ago

Thank you so much for the explanation @mciprian13! I have a prototype graph pass to fold MatMul and BachedAdd into a new atomic node (inspired by your PR https://github.com/pytorch/glow/pull/4280). But I guess I should implement it in the CPU backend for the quantization flow to work properly?

mciprian13 commented 4 years ago

@psyhtest There is no need for any graph transformation pass since a backend can choose not to lower a FC into a MatMul + BatchedAdd. The only thing missing is an atomic implementation for the FullyConnected in the CPU backend (more explicitly in the libjit library).

psyhtest commented 4 years ago

@mciprian13 I see. Thanks for the hint! My source has Gemms instead of FCs but may be those can be folded into FCs.