Noise injection scaling after the first noise layer

jhss commented 4 years ago

Dear author

I read your paper and code with great interest, but some parts don't make sense. I would be grateful if you would answer below questions.

When a noise layer is not in the first, I suspect that the noise injection is implemented like this.

class DropoutVarPropagationlayer(VarPropagationLayer):
...
   def _call_full_cov(self, x):
     ...
     mean = self.layer.input
     mean_shape = [mean.get_shape().as_list()[-1]]
     new_mean = tf.ones(mean_shape, dtype=tf.float32) * (1 - self.rate_tensor) 
     new_var = tf.ones(mean_shape, dtype=tf.float32) * self.rate_tensor * (1 - self.rate_tensor)
     out = covariance_elementwise_product_rnd_vec(mean, x, new_mean, new_var)

and its corresponding equation in the paper is as follows.

My question is why you don't scale 'out' variable in this case? The first noise layer's output is out = x**2*self.rate/(1-self.rate), so I think a subsequent noise layer's output should be scaled like the first noise layer, but it seems that you don't.

Would you elaborate this?

janisgp commented 4 years ago

Hi,

thanks for your interest in our work :)

This is a valid question as it looks confusing to me as well in hindsight and probably could have been implemented more transparently. Nevertheless, let me try to elaborate:

First thing to note is that the initial noise layer receives the prior activations as input and subsequent noise layer receive the prior variance/covariance as input. Both actually implement the same formula ((19) in the supplementary material). For the initial noise it is only very simplified because we only need the second term in (19). This is what is being implemented by the scaling (if you are confused about why the scaling is the way it is, first check out this issue https://github.com/janisgp/Sampling-free-Epistemic-Uncertainty/issues/2). For subsequent noise layer we also need to compute the other two terms in (19). If you check out the code of covariance_elementwise_product_rnd_vec you will see the three terms are individually calculated. Keeping in mind that the mean of the random variable describing the noise layer also incorporates this "scaling" new_mean = tf.ones(mean_shape, dtype=tf.float32) * (1 - self.rate_tensor), one can actually see that nothing is missing.

Does this make sense and clarify things? If not please let me know :)

Best, Janis

jhss commented 4 years ago

Dear @janisgp,

I'm sorry for the late reply. Thank you for answering, would you confirm if my understanding is correct?

When "_call_full_cov" is called in the initial noise layer, x is activations, and corresponding noise is calculated as follows

def _call_full_cov(self, x):
     if self.initial_noise:
         out = x**2*self.rate/(1-self.rate)
         out = tf.linalg.diag(out)

I understand scaling in a way that a mathematic equation to calculate the initial noise is x**2 * p * (1-p), but "Keras scales activations during training phase with 1/(1-p)", so you multiply 1/(1-p)^2 by the initial noise, and get x*2p/(1-p). In this case, you scale a covariance matrix of activations. (is this right?)

When "_call_full_cov" is called in subsequent noise layers, I think 'x' is a covariance matrix of activations, and 'self.layer.input' is mean of activations. (is this right?)

def _call_full_cov(self, x):
     if self.initial_noise:
         ...
     else:
        mean = self.layer.input
        mean_shape = [mean.get_shape().as_list()[-1]]
        new_mean = tf.ones(mean_shape, dtype=tf.float32) * (1 - self.rate_tensor) 
        new_var = tf.ones(mean_shape, dtype=tf.float32) * self.rate_tensor * (1 - self.rate_tensor)
        out = covariance_elementwise_product_rnd_vec(mean, x, new_mean, new_var)

Since the covariance matrix of activations is already scaled in the first noise layer, it doesn't need to scale again in the subsequent noise layer, so you just pass 'x' to 'covariance_elementwise_product_rnd_vec', and calculate a new covariance matrix. (is this rihgt?)

Thank you very much for the detailed answer.

janisgp commented 4 years ago

Hi @jhss

You are correct in each of your statements :)

Please, let me know if there are any other problems!

Best, Janis

janisgp / Sampling-free-Epistemic-Uncertainty

Noise injection scaling after the first noise layer #3