NaN value(not too often) from self attention of Q and K

This is not strictly related to this repo and if this is against your issue, please let me know and remove this!

Q0) when we do einsum at the level of forward propagation, are we having gradient and updates the gradient at this Q and K ? I assume this is wrong statement since we don't have Weight, but this is just a pure calculation of vectors. Am I right?

Q1) I try to do some experiment and with my customized self-attention calc step, I got NaN value at this level: https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reformer_pytorch.py#L318

what I did is (for instance)

q = q[0,:,:,:] * 50
k = look_back(q) # without normalization!!

bq = cosh(torch.log(q**2+1)**0.5)
bk = cosh(torch.log(k**2+1)**0.5)
res = torch.einsum('bhie,bhje->bhij', bq, bk) * (dim ** -0.5)

and some of the value here(res) got NaN

Do you recommend any guess why I get the NaN value ? Or in general when do we get this NaN?

Q2) Should I assume that this NaN value is produced because the res value is either too small or large ? if this is because too SMALL, then I need to give -inf val instead? Q2-1) If this is because too LARGE, then should I give 1 at the level of softmax result?

Sorry for this dumb question Lucid..!

lucidrains / reformer-pytorch

NaN value(not too often) from self attention of Q and K #100