I noticed that it is not possible to use loss.backward(retain_graph=True) with any of the rnn-t losses (which is useful when training with multiple optimizers).
It fails because of the in-place multiplication on gradients, saying:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2, 8, 17]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I noticed that it is not possible to use
loss.backward(retain_graph=True)
with any of the rnn-t losses (which is useful when training with multiple optimizers). It fails because of the in-place multiplication on gradients, saying:This PR fixes the issue.