Closed tholmb closed 2 years ago
Hi @tholmb,
apologies for the late reply. There is a mistake in the provided code, and it will be corrected. You are right; according to eq 9 in the paper, the correct implementation should be just F.normalize(grad, p=2, dim=1).
We did some additional experiments by multiplying the absolute or squared normed gradients with the epsilon parameter. And we found that squaring the normed gradients provides slightly better performance, and we think that it is because squaring the gradients emphasizes the pixels that are more likely to contribute to shortcuts.
I hope this is helpful.
Thanks for the reply! This is also what I thought that by squaring the normed gradients just highlights the effect and might actually lead in better results. I will close the issue
First of all, thanks for the great work! I'm a little bit confused that why the gradients are squared in the grad_norm() functions?
Equation (9) in the paper states $x^{*(i)} = x^{(i)} + \epsilon \frac{\nabla{x} z^{(i)}}{\left | \nabla{x} z^{(i)} \right |_{2}}$. In my opinion F.normalize(grad, p=2, dim=1) without the power of 2 corresponds the equation.
What I have understood incorrectly?