Question About Gradient mismatch

ruizhoud / DistributionLoss

Source code for paper "Regularizing Activation Distribution for Training Binarized Deep Networks"

31 stars 6 forks source link

Question About Gradient mismatch #1

Closed jack-kjlee closed 5 years ago

jack-kjlee commented 5 years ago

Hi, BNN_DL is very nice paper, solving the fundamental problem of BNN. However, I have some ploblem understanding Degeneration, Saturation and Gradient mismatch.

According to what I understand, Intuitively,

Degeneration Loss makes mean of activation to zero.
Saturation Loss reduces the number of activation with A>1 and A<-1

Then, What is the role of Gradient Mismatch Loss? It is difficult to understand the meaning of formula [ReLU(1-|u|-k*sigma)]^2. I think the only way avoiding gradient missmatch problem is changing activation function. (BNN+, Self-binarizing network)

Could you explain in more detail?

Thanks.

ruizhoud commented 5 years ago

Hi,

Gradient mismatch happens when almost all A satisfies -1<A<1. Yes, you could use other activation functions to eliminate mismatch, but usually at a cost of increased saturation and/or degeneration.

Please let me know if you have any questions!

ruizhoud commented 5 years ago

Follow-up: a key intuition in this paper is that we don't force the activation distribution to follow some "shape", instead, we just avoid the really undesired shapes.

jack-kjlee commented 5 years ago

Thank you for your reply. What is the undesired shape you want to prevent through a gradient mismatch loss? I think it may be small sigma distribution (sharp normal distribution) within -1, +1 (2nd, 3rd graph of Figure 7 in the paper). Is it right?

ruizhoud commented 5 years ago

An undesired distribution for gradient mismatch is that almost all the activations are within the range of [-1, 1]. Consider the activation function in the forward phase. It is just a sign function. So theoretically, the gradient should be 0 for any point except A=0, where the gradient is infinity. We use clipped straight-through to estimate the gradient, but at a cost of gradient mismatch. If all activations fall in [-1,1], then the gradients w.r.t. all the activations are sort of "inaccurate", while if all activations fall in (-inf, -1) or (1, inf), then the gradients are "accurate" (but has saturation problem in this case). The gradient mismatch loss would penalize the former case.