About the normalization of the gradients

waychin-weiqin / ITSA

ITSA: An Information-Theoretic Approach to Automatic Shortcut Avoidance and Domain Generalization in Stereo Matching Networks (CVPR 2022)

MIT License

15 stars 1 forks source link

About the normalization of the gradients #3

Closed tholmb closed 2 years ago

tholmb commented 2 years ago

First of all, thanks for the great work! I'm a little bit confused that why the gradients are squared in the grad_norm() functions?

  def grad_norm(self, grad):
      grad = grad.pow(2)   # <-----------------
      grad = F.normalize(grad, p=2, dim=1) 
      grad = grad * self.eps
      return grad

Equation (9) in the paper states $x^{*(i)} = x^{(i)} + \epsilon \frac{\nabla{x} z^{(i)}}{\left | \nabla{x} z^{(i)} \right |_{2}}$. In my opinion F.normalize(grad, p=2, dim=1) without the power of 2 corresponds the equation.

What I have understood incorrectly?

waychin-weiqin commented 2 years ago

Hi @tholmb,

apologies for the late reply. There is a mistake in the provided code, and it will be corrected. You are right; according to eq 9 in the paper, the correct implementation should be just F.normalize(grad, p=2, dim=1).

We did some additional experiments by multiplying the absolute or squared normed gradients with the epsilon parameter. And we found that squaring the normed gradients provides slightly better performance, and we think that it is because squaring the gradients emphasizes the pixels that are more likely to contribute to shortcuts.

I hope this is helpful.

tholmb commented 2 years ago

Thanks for the reply! This is also what I thought that by squaring the normed gradients just highlights the effect and might actually lead in better results. I will close the issue