Closed jack-kjlee closed 5 years ago
Hi,
Gradient mismatch happens when almost all A satisfies -1<A<1. Yes, you could use other activation functions to eliminate mismatch, but usually at a cost of increased saturation and/or degeneration.
Please let me know if you have any questions!
Follow-up: a key intuition in this paper is that we don't force the activation distribution to follow some "shape", instead, we just avoid the really undesired shapes.
Thank you for your reply. What is the undesired shape you want to prevent through a gradient mismatch loss? I think it may be small sigma distribution (sharp normal distribution) within -1, +1 (2nd, 3rd graph of Figure 7 in the paper). Is it right?
An undesired distribution for gradient mismatch is that almost all the activations are within the range of [-1, 1]. Consider the activation function in the forward phase. It is just a sign function. So theoretically, the gradient should be 0 for any point except A=0, where the gradient is infinity. We use clipped straight-through to estimate the gradient, but at a cost of gradient mismatch. If all activations fall in [-1,1], then the gradients w.r.t. all the activations are sort of "inaccurate", while if all activations fall in (-inf, -1) or (1, inf), then the gradients are "accurate" (but has saturation problem in this case). The gradient mismatch loss would penalize the former case.
Hi, BNN_DL is very nice paper, solving the fundamental problem of BNN. However, I have some ploblem understanding Degeneration, Saturation and Gradient mismatch.
According to what I understand, Intuitively,
Then, What is the role of Gradient Mismatch Loss? It is difficult to understand the meaning of formula [ReLU(1-|u|-k*sigma)]^2. I think the only way avoiding gradient missmatch problem is changing activation function. (BNN+, Self-binarizing network)
Could you explain in more detail?
Thanks.