The NVDLA documentation doesn't clearly describe how the scaling converters need to be programmed for INT8 quantized DNN inference. My question/confusion specifically is: How are scales (i.e., calibration table) computed for passing to the NVDLA compiler? The documentation recommends using TensorRT but doesn't mention exactly what the scale means. This is my understanding. Consider:
quantizedLayerInput = S1 * Input
quantizedWeights = S2 * W
resultTensor = S1 * S2 * R
INT8ResultTensor = R * S3 / (S1 * S2) // S3 computed from layer output distribution
Each scale is computed as the following:
S_dist = 256 / (dist_max - dist_min)
If this understanding is correct, the scale passed to the NVDLA compiler should be:
Hi,
The NVDLA documentation doesn't clearly describe how the scaling converters need to be programmed for INT8 quantized DNN inference. My question/confusion specifically is: How are scales (i.e., calibration table) computed for passing to the NVDLA compiler? The documentation recommends using TensorRT but doesn't mention exactly what the scale means. This is my understanding. Consider:
Each scale is computed as the following:
S_dist = 256 / (dist_max - dist_min)
If this understanding is correct, the scale passed to the NVDLA compiler should be:
S3 / (S1 * S2)
Guidance is very much appreciated.
Thanks, Hashim