Closed howardyclo closed 4 years ago
Q2: How can I find the reference equation of propagating variance of batch normalization layer? I looked into the paper "Bayesian Uncertainty Estimation for Batch Normalized Deep Networks" but failed to identify it.
Btw, the returns seem different between training and evaluation mode in your batchnorm code, could you also explain why is that?
Thanks again.
Hi,
thanks for the questions and the interest.
Q1: Definitely a justified question :) I should have made my intention clearer at that point. I deviate from the formula in the paper in order to achieve comparability with the keras dropout implementation. When I sample using dropout, I run models in training mode in keras. Keras scales activations during training phase with 1/(1-p) ( see here ). In order to compare my results with the sampling-based results obtained with keras, I need to "imitate" keras training behaviour and substitute a_i^2 with (a_i / (1 - p))^2 in the original formula. This leads to the deviation.
Q2: I believe that code snippet for propagating Batchnorm is an artifact from the past. It is probably not correct. We never ran experiments with it, as we considered it most important to compare with MC dropout. Also the paper you mentioned only considers sampling-based uncertainty estimation using Batchnorm. Thus they do not provide such a formula.
If you still want to try to apply our method to Batchnorm, you can do the following: train a network with Batchnorm, then determine the variance of Mean and Variance AFTER training by predicting several batches from the training data distribution (this determines what noise the network is "used to" from the training). Then apply our mechanism. If you are interested in the necessary formulas, I can provide. Just let me know :)
Best, Janis
Dear @janisgp,
Thanks for answering Q1 & Q2. For Q2, I would be interested to see the formulas of variance propagation of batchnorm. Would be very appreciated if you provide ones! Additionally, I have another questions.
Q3: As mentioned in the paper, you refer the a_i
in the variance of initial noise p * (1-p) * a_i^2
as "mean activation of node i". However, the a_i
implemented in your code seems to be just the activation of input, instead of "mean" activation (I thought "mean" activation is that you get the average of activations from multiple inputs?).
if self.initial_noise:
# x is `layer.input` specified in `uncertainty_propagator.py #L26`.
out = x**2*self.rate/(1-self.rate)
Could you explain the discrepancy here?
Q4: In section 3.3, it is assumed that the activations are Gaussian distributed in order to approximate the variance of ReLU activation. I looked into the reference paper "Fast Dropout Training" and still did not get it that why this assumption is valid (since your dropout implementation is not GaussianDropout?). Could you also explain this?
Appreciate your precious time again!
Hey,
regarding variance propagation through Batchnorm: With our approximation, you need to multiply the Jacobian of the Batchnorm layer from left and right to the, full or diagonal, covariance matrix. You can actually find many websites which derive the Jacobians (or the partial derivatives) for Batchnorm. E.g. see here: https://kevinzakka.github.io/2016/09/14/batch_normalization/
Q3: Well spotted. That is actually a mistake in the paper. Of course, the variance of a bernoulli random variable X scaled with a constant C is Var = C^2 * Var(X). So the implementation is correct and the wording in the paper is incorrect.
Q4: Please check section 2.2 in the fast dropout paper. The argument is based on the central limit theorem. Of course, practically many things can go wrong, e.g. the magnitude of one activation dominates everything. But often we observe the individual summands for computing an activation are "similarly" enough distributed, rendering the activation approximately gaussian. I suggest you try it out yourself on a small network, to convince yourself. Therefore, note your stochasticity should come only from sampling from the Bernoulli distribution.
Please let me know if there are further questions!
Many thanks and best, Janis
Hi @janisgp,
Thanks for the above great answers!
I am still wondering (Q5) why the variance propagation method can work as approximating MC dropout, since the paper does not give too much intuitive or theoretical explanations.
To my understanding, when training with dropout layers, the model implicitly learns a distribution of weights. When doing MC dropout during inference, we are sampling model weights from learned distribution, thus we can also get the approximated uncertainty over weights. However when doing variance propagation, we don't do sampling during inference. We analytically compute the variance from the first dropout layer placed in the network, and then propagate it until the last prediction layer to present the final variance as uncertainty. To me, this seems that we are still computing the variance based on a point estimate of model weights (i.e., inference once) and does not explicitly compute the variance from a distribution of model weights. This leads to the question that why this analytically computed variance can be presented as model uncertainty (or the approximated MC dropout).
Looking forward to more intuitive explanation on Q5!
Best, Howard
Hey,
You are questioning the very central idea of the paper there. I'm really sorry if it is not clear enough in the paper.
Let me try again: assume you trained a neural network with dropout layers. Then at inference time you know you could simply sample from the "learned" distribution, hence the Bernoulli distribution at each droput layer, to obtain the variance of your prediction, which gives you some hint about how certain your network is. However, consider e.g. the first dropout layer. The (scaled) Bernoulli distribution which your activations originate from after this dropout layer (Bernoulli random variable scaled with incoming activations) is fully defined by the first two moments, which we happen to know (->p*a and p(1-p)a^2, see previous answer). We thus can propagate these moments (approximately) using error propagation (Instead of drawing several samples), and hope that the propagated moments are still a good description of the distribution of the activations. Equivalently and very simple: imagine adding two Gaussian random variables. Two get the variance and mean of the result, you could sample each variable and compute them using the population. However, you could also simply add the means and the variances to obtain the final result and avoid the computationally expensive sampling ;-)
However, note that there are some caveats:
I hope I was able to give some clarity :)
If not just let me know.
Best, Janis
@howardyclo Does this answer your question? So, can it be closed?
Hi @janisgp,
Still trying to fully understand why it can work. The quality of final variance seems to be largely depends on the first dropout layer, which is just the scaled activations p * (1-p) * a_i^2
. Thinking in a numerical computational perspective, if given a out-of-domain input for our model to predict, and we want our model to output high variance. Does it just mean that the a_i
will gonna be large before the first dropout layer (and vice versa if given a in-domain input)? If yes, then why a_i
has this kind of property? Is it due to that dropout training make it so? Or... maybe I might just over-simplify this because the final variance doesn't just fully depend on the first variance, but also the weights and Jacobians of intermediate layers (but what they do is also just scaling the initial variance, right?). Sorry I think the question here is vague but... hope you understand my concern.
Howard
Hi @janisgp
Is there going to be further response to the question? Or, if the question is not clear for you, please let me know.
Thanks!
Hey,
sorry for the delay.
"Or... maybe I might just over-simplify this because the final variance doesn't just fully depend on the first variance, but also the weights and Jacobians of intermediate layers" Here you basically answer the question yourself.
"but what they do is also just scaling the initial variance, right?" No, it is not just scaling. The correlate random variables (activations) with each other. This can reduce or magnify variance.
If you mind drifts towards ood samples, maybe think this way: When you train a NN with dropout, due to maximum likelihood training, your NN basically learns how to minimize the influence of the noise on the predictions on the training data distribution. However, once you leave the training data distribution at test time, your network does not know anymore how to minimize the noise influence. Consequently, the probability that you observe high variance in the output increases.
Does this clarify it?
Best, Janis
Hi @janisgp,
"When you train a NN with dropout, due to maximum likelihood training..." Yeah, that is one of the perspective that I thought before.
I think currently the issue can be closed. Very appreciated for your time for the interesting discussion! And actually, I am going to try to apply your work on our company's product. If there is any feedback I'll let you know ;-D.
Howard.
Hi @howardyclo ,
Good luck with applying our work :) Hope it can be helpful!
Contact me any time regarding feedback and further questions that may arise! You are encouraged to you the email correspondence referenced in the paper!
Best, Janis
Hi,
thanks for the questions and the interest.
Q1: Definitely a justified question :) I should have made my intention clearer at that point. I deviate from the formula in the paper in order to achieve comparability with the keras dropout implementation. When I sample using dropout, I run models in training mode in keras. Keras scales activations during training phase with 1/(1-p) ( see here ). In order to compare my results with the sampling-based results obtained with keras, I need to "imitate" keras training behaviour and substitute a_i^2 with (a_i / (1 - p))^2 in the original formula. This leads to the deviation.
Hey Janis,
Regarding Q1, I still have confusions though. I don't get the "imitating behavior" you meant. As far as I understand, the inference and training are independent to each other. That means, to use a_i^2 or (a_i/(1 - p))^2 for that formula depends on what the inputs are when executing the forward-pass at that moment.
In keras, when model.fit() is used, the inputs after dropout layer will be scaled up by 1/(1 - p). But it's not when model.predict() (which you used for the UCI experiments) is used, refer to this link. That means, if you used model.predict(), there is no need to substitute a_i^2 with (a_i/(1 - p))^2 in the formula. Because the magnitude is in the same scale to that during training.
What do you think?
Best, Jianxiang
Hey Jianxiang :)
Sorry for my late response!
I am using model.predict() in the UCI experiment for the reference RMSE of a standard neural network. When computing the dropout predictions, I call the model with training=True (see here). Its a bit confusing, but I first define the keras.backend.function that takes as second input whether to be executed training mode which I later set to True.
Does this make sense?
Best, Janis
Hey Jianxiang :)
Sorry for my late response!
I am using model.predict() in the UCI experiment for the reference RMSE of a standard neural network. When computing the dropout predictions, I call the model with training=True (see here). Its a bit confusing, but I first define the keras.backend.function that takes as second input whether to be executed training mode which I later set to True.
Does this make sense?
Best, Janis
Hey Janis,
thanks for the explanation, it's clear to me now :)!
Best, Jianxiang
Dear author,
Thanks for publishing this inspiring good paper.
Q1: I found there's a bit mismatch of the implementation of the variance of initial noise between the equation 5. in the paper
/sigma^2 = p(1-p)a_i^2
and the snippet in your codewhy the initial variance
p * (1-p) * a_i^2
in the paper becomes this implementationa_i^2 * p / (1-p)
?Looking forward to your answer ;-D Thanks.