lucidrains / reformer-pytorch

Reformer, the efficient Transformer, in Pytorch
MIT License
2.1k stars 254 forks source link

NaN value(not too often) from self attention of Q and K #100

Closed muiPomeranian closed 4 years ago

muiPomeranian commented 4 years ago

This is not strictly related to this repo and if this is against your issue, please let me know and remove this!

Q0) when we do einsum at the level of forward propagation, are we having gradient and updates the gradient at this Q and K ? I assume this is wrong statement since we don't have Weight, but this is just a pure calculation of vectors. Am I right?

Q1) I try to do some experiment and with my customized self-attention calc step, I got NaN value at this level: https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reformer_pytorch.py#L318

what I did is (for instance)

q = q[0,:,:,:] * 50
k = look_back(q) # without normalization!!

bq = cosh(torch.log(q**2+1)**0.5)
bk = cosh(torch.log(k**2+1)**0.5)
res = torch.einsum('bhie,bhje->bhij', bq, bk) * (dim ** -0.5)

and some of the value here(res) got NaN

Do you recommend any guess why I get the NaN value ? Or in general when do we get this NaN?

Q2) Should I assume that this NaN value is produced because the res value is either too small or large ? if this is because too SMALL, then I need to give -inf val instead? Q2-1) If this is because too LARGE, then should I give 1 at the level of softmax result?

Sorry for this dumb question Lucid..!

lucidrains commented 4 years ago

@muiPomeranian ahh I'm not entirely sure, but I believe it has to do with your torch.log. as a rule of thumb, whenever you give a tensor to log, you should make sure it is a positive number that is never 0 (most code just add a small epsilon)