jhjacobsen / invertible-resnet

Official Code for Invertible Residual Networks
MIT License
519 stars 78 forks source link

Spectral norm causes gradient signal to be lost when sigma exceeding coeff #4

Closed jarrelscy closed 5 years ago

jarrelscy commented 5 years ago

Something I was struggling with with my own implementation of Gouk's spectral norm is that a spectral normalized layer seems to become stuck once the sigma values reach the coeff.

What I mean by this is: Take a spectral normalized FC layer with 2 inputs and 1 output, and feed normally distributed random numbers into it, and ask it to maximize the output. This increases the weights until it reaches a sigma of > coeff.

Then take the same layer, feed the normally distributed random numbers into it and ask it to minimize the output. You'd expect this to decrease the weights until it reaches a sigma of 0, but it's sigma starts > coeff, nothing happens! In fact the weights don't receive very much gradient signal at all.

I think this might be because this line: sigma = torch.dot(u, torch.mv(weight_mat, v))

happens with grad enabled, meaning that the gradient is propagated along this pathway, forcing the sigma to stay at 1.

I have made a notebook to demonstrate this problem, and my 'fix' Gouk-jhjacobsen.zip

I'm not sure if this is the expected behaviour, I'd have thought this was analogous to the dying ReLU problem, as layers' sigmas become saturated they'd drop out and stop learning, which might be suboptimal.

jhjacobsen commented 5 years ago

Thanks very much for sharing your observations, we are looking into it and will get back to you here.

jarrelscy commented 5 years ago

Not a problem. To further clarify, I think all that happens is that once the layer reaches sigma > coeff, it's sigma becomes fixed at coeff, just like the pytorch miyato implementation.

The layer weights can still update but they would be then constrained which I think defeats the purpose of the soft normalisation.

I have a hunch that this may not be such a big problem in most cases as only a few layers will run into sigma > coeff, as you pointed out that MNIST training seems to be reversible even without the spectral norm.

On Tue., 18 Jun. 2019, 04:13 Jörn Jacobsen, notifications@github.com wrote:

Thanks very much for sharing your observations, we are looking into it and will get back to you here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jhjacobsen/invertible-resnet/issues/4?email_source=notifications&email_token=AAOEUB7KZ7ZPSGBYX7U47PDP27H5NA5CNFSM4HYOSWS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX4AJCQ#issuecomment-502793354, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOEUB6CM2CVMMJG73BQWC3P27H5NANCNFSM4HYOSWSQ .

jhjacobsen commented 5 years ago

Now I understand what you mean, this is expected and in fact desired behaviour.