jonasrauber / eagerpy

PyTorch, TensorFlow, JAX and NumPy — all of them natively using the same code
https://eagerpy.jonasrauber.de
MIT License
694 stars 39 forks source link

norms should delegate to the backend where possible #6

Open mglisse opened 4 years ago

mglisse commented 4 years ago

Hello, with a pytorch tensor t, I can call t.norm(p, dim). This gives a similar result to eagerpy's lp, but makes a huge difference when it comes to the gradient. Pytorch has this feature where deriving sqrt in 0 gives infinity (mathematically sensible), which often yields a gradient of NaN. However, special functions like norm are handled specially, similarly to abs, and return a suitable subgradient (0). Could you please make l2/lp/... call norm for the pytorch backend?

mglisse commented 4 years ago

With an example to demonstrate the issue:

import torch
import eagerpy
a = torch.tensor([0.], requires_grad=True)
torch.norm(a, p=2).backward()
print(a.grad)
eagerpy.astensor(a).norms.l2().raw.backward()
print(a.grad)

tensor([0.]) tensor([nan])

jonasrauber commented 4 years ago

Hi @mglisse, thanks for request and the example code. That makes a lot of sense and I think this might be doable. May I ask how you use EagerPy? Do you just use it as an alternative API for PyTorch, without needing the the ability to run the same code using different frameworks, or why is this only a problem with PyTorch?

mglisse commented 4 years ago

Hi, thanks for the reply. I use eagerpy so I can write the code only once and let it work with several frameworks. It is true that currently I mostly experiment with pytorch though. The problem isn't limited to pytorch. The first time I hit this NaN issue with pytorch, jax was giving good numbers, so I assumed they were doing something different. I didn't keep the exact code, and now that I try to reproduce it, I seem to get NaN from jax and pytorch in the same cases. So I don't know if my experiment at the time was bogus, or hit a very special case... A good thing is that all frameworks seem to provide a norm function (at least for p not 0?). A bad thing is that the one in jax (I did not check tensorflow) does not seem to have a special (sub)gradient implementation, it also gives a NaN gradient for jax.numpy.linalg.norm(x,2) in 0. But I could go ask them about that. Another bad thing is that they don't have the same definition. On a matrix [[1,2],[3,4]] with p=1, numpy/jax return 6 while torch/tensorflow return 10, that complicates things a bit... Of course there are workarounds, I could compute norms manually and add tiny (trying to work through the various dtype/finfo to get it) before doing the square root. Or I can let eagerpy compute the norm and if result is 0, result=result.from_numpy(0.) to replace it with a constant (or actually some better formulation to get the right dtype, plus with pytorch this one does not have requires_grad so if I call .raw.backward() directly on it without combining it with other numbers, it fails).