Automatically select the relative quantization precision of each layer, i.e., use fp16, int8 or one bit
not all layers have the same distribution of floating-point values
the network can have significantly different sensitive to the quantization of each layer.
One-order gradients are not enough to decide the sensitive, so use second-order Hessian matrix (maximum eigenvalue \lambda) to decide the sensitive, furthermore, considering the size of a layer, (denoted as n), sensitive of one layer = \lambda / n.
Use the pre-trained network to calculate Hessian matrix,
matrix-free power iteration algorithm: avoid explicitly form the Hessian due to the large size.
Multi-stage fine-tuning -- define a fine-tuning order
re-train, sort layers in descending order according to the product of \lambda and weight difference.
ICCV 2019
PDF
Contribution
Automatically select the relative quantization precision of each layer, i.e., use fp16, int8 or one bit
Multi-stage fine-tuning -- define a fine-tuning order
re-train, sort layers in descending order according to the product of \lambda and weight difference.