hey-yahei / Quantization.MXNet

Simulate quantization and quantization aware training for MXNet-Gluon models.
MIT License
46 stars 7 forks source link

How to avoid "self.current_input_max = F.max(F.abs(x), axis=(1, 2, 3)).mean().asscalar()" to be a node in graph ? #1

Open yang2640 opened 5 years ago

yang2640 commented 5 years ago

"self.current_input_max = F.max(F.abs(x), axis=(1, 2, 3)).mean().asscalar()" inside "def _conv2d_forward", this will create symbol node, and should cause error because asscalar() is not supported in symbol.

How to avoid "self.current_input_max = F.max(F.abs(x), axis=(1, 2, 3)).mean().asscalar()" to be a node in graph ?

hey-yahei commented 5 years ago

Hi, @zhouyang2640 I'm sorry that this is only a simulation for quantization to help understand and the converted model can't be hybridize to symbol directly. Try mxnet quantization if you need to accelerate your model with quantization.

Thanks

yang2640 commented 5 years ago

Thanks. I am using it for my own highly customized implementation, I have figured out the error I encountered. It seems the implementation of quantization aware training may have some errors, if training from scratch (deferred initiation). load parameter, then training will avoid a sequence of errors.

hey-yahei commented 5 years ago

Hi, @zhouyang2640 Oh..yes, you're right. I've only concerned the situation that quantize aware training for pretrained models before. To do quantization aware training from scratch, maybe you can construct your own custom model and forward once to apply defered initialization at first. For example,

from mxnet import nd
from gluoncv.model_zoo import cifar_resnet56_v1

# Construct model and initialize
net = cifar_resnet56_v1()   # pretrained=False
net.initialize()

# Forward once to apply defered initialization
in_ = nd.ones(shape=(1, 3, 32, 32))
_ = net(in_)

# Convert model and initialize quantized parameters as usual
convert.convert_model(net, exclude=exclude, convert_fn=converter)
qparams_init(net)

Thanks, YaHei

yang2640 commented 5 years ago

The way to simulate the quantization conv is kind of approximate, which is different from the TensorFlow paper. I think the big difference between the approximating and tensflow, is the place of rounding, where the information happens. I am not sure how big is the difference ?

hey-yahei commented 5 years ago

Hi, @zhouyang2640 Is the paper Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference that you mentioned? There is only little difference between simulation and reality in theory -- when do simulated quantized-convolution,

w'_fp32 = round(w_fp32 w_scale) w_scale x'_fp32 = round(x_fp32 x_scale) x_scale y = w'_fp32 * x'_fp32 + b_fp32

and when do real quantized-convolution,

w_int = round(w_fp32 w_scale) x_int = rount(x_fp32 x_scale) y = w_int * x_int / w_scale / x_scale + b_fp32

or

w_int = round(w_fp32 w_scale) x_int = rount(x_fp32 x_scale) b_int = round(b_fp32 w_scale x_scale) y = (w_int * x_int + b_int) / w_scale / x_scale

you will find that there is only little difference between them in direct convolution. But,

  1. Beside round, ceil and floor may be used to convert fp32 data to interge, some little difference will occur with different function.
  2. Only consider direct convolution or im2col-GEMM convolution here. As for FFT convolution or Winograd convolution, because the data may be amplified greatly after transform, it will introduce a lot of quantized errors when do real quantized-convolution.

Thanks, YaHei

yang2640 commented 5 years ago

I am doing experiments on the customized model, wish this approximation could work. I found for my customized model, the fake_bn will significantly slow down the loss descent, once I disable fake_bn, the model loss could go down as usual.

yang2640 commented 5 years ago

A followup question from my last question. Assuming fused batchnorm is faster, and assuming the training can lead to the same performance. Why all the frameworks directly add an option say "fused batch norm" or support an operation called "fused batch norm".

hey-yahei commented 5 years ago

Hi, @zhouyang2640

  1. Note that you should bypass BatchNorm when you use fake_bn, or the model will do BatchNormalization twice and it's harmful for training.
  2. fake_bn would do much more calculation than normal bn, so it will slow down the training significantly.
  3. Since quantize aware training introduce Gaussian Noise to weights of convolution, it will help model avoid overfit and slow down training as well.
  4. BatchNormalization only make sense when train but can be merged into convolution when predict. More details refer to Real-time object detection with YOLO | machinethink in English or Mobilenet-SSD网络解析 - BN层合并 | Hey~YaHei! in Chinese.
  5. It will accelerate prediction if merge bn into convolution, but it is bad to train a model without bn. So, fused batch norm only used at predict phase. What's more, since we usually merge bn when predict, quantization should be applied to the weights after merge bn but not to the origin ones. It's why we need fake_bn.

Thanks YaHei

yang2640 commented 5 years ago

Yes, I have been careful with the "fake_bn", e.g., by-pass-bn. My practical finding is that it is harmful for training, it indeed slows down the model converge (in terms of the loss descent). Is it a well-known fact that fused-batch norm training is harmful, and typically fused Batchnorm is not used for training, and only used for forward process ?

hey-yahei commented 5 years ago

Hi, @zhouyang2640 Is fused batchnorm that you mentioned above equivalent to fake bn? In my concept, fake bn is that do batch-nomalization in convolution layer(both do batch-normalization and do convolution) while fused bn is that merge the parameters of bn layer(running_mean, running_var, gamma, beta) into convolution layer's parameters(weight, bias) directly.

In prediction phase, fake bn, merge bn and the origin model with bn have the same performance.

But in training phase, fake bn and the origin model with bn are both affected by bn while fused bn is just a model without bn. And some paper shows that bn will accelerate the convergence of the model, so fused bn indeed slows down the convergence of model. It seams that fake bn should have the same speed of convergence as the origin model with bn in theory (but I've had no pratical evidence yet).

Thanks YaHei

yang2640 commented 5 years ago

Yes, I am referring to "fused batchnorm" as the TensorFlow paper for the quantization aware training. But your implementation for fake_bn actually is "fused batchnorm"

yang2640 commented 5 years ago

I have found some TensorFlow quantization aware training code here "https://github.com/tensorflow/tensorflow/blob/e4262fb2fbf1cb33aaea79ff81754d1e92e99af1/tensorflow/contrib/quantize/python/fold_batch_norms.py#L344"

" """Computes batch norm correction params. Before batch normalization is frozen: We use batch statistics for batch norm. correction_scale = sigma_b/sigma_mv correction_recip = 1/correction_scale correction_offset = 0 After batch normalization is frozen: correction_scale = sigma_b/sigma_mv correction_recip = 1 correction_offset = gamma*(mu_b/sigma_b-mu_mv/sigma_mv). Batch norm is frozen if global_step > bn_freeze_delay. The corrections ensure that: a) The weights are quantized after scaling by gamma/sigma_mv. This enables smoother training as the scaling on the weights changes slowly, rather than jump across mini-batches b) Changing the values of the corrections allows for one to switch between using batch statistics to using moving mean and average, without requiring changes to batch_norm "

Look at the above trick, I have no idea why it it doing the above