Open raphaelreinauer opened 2 years ago
Is it this according to the paper?
min_loss.backward(retain_graph=True)
max_loss.backward()
Hi @raphaelreinauer, thanks for the suggestion, the training script is certainly incomplete. Your comment makes sense, and I agree my loss scheme is incorrect.
With regards to the stop gradient, I believe this is handled within the loss functions as I detach the S tensors and P tensors in min_loss
and max_loss
respectively. Please let me know if this is what you had in mind. https://github.com/spencerbraun/anomaly_transformer_pytorch/blob/6d15200911260eee910a3664d70f07886c47708b/model.py#L128-L138
@sappersapper Thank you for the suggestion. I believe that will work, though might need to think about whether the memory overhead can be avoided.
When I tried to employ the below as
min_loss.backward(retain_graph=True) max_loss.backward()
it worked well.
However, at the second input, I got the following error.
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I think it is mainly due to self.P_list and self.S_list at AnomalyAttention were declared at the definition.
Should they declared at the forward ?
@syoungbaak I am facing the same problem (one of the variables needed for gradient computation has been modified by an inplace operation). Did you solve this problem ?
It will be solved if self.P_list and self.S_list at AnomalyAttention are defined at the forward part not the initialization part.
@syoungbaak I am facing the same problem (one of the variables needed for gradient computation has been modified by an inplace operation). Did you solve this problem ?
@syoungbaak Ok, thank you !
There is an issue with the loss: https://github.com/spencerbraun/anomaly_transformer_pytorch/blob/6d15200911260eee910a3664d70f07886c47708b/train.py#L53 the reconstruction loss cancels, and we only end up with -2*discrepancy loss. Additionally, the stop-gradient as described in the paper is not implemented.