Closed yanboliang closed 1 year ago
cc @bdhirsh @anijain2305
I talked to Yanbo offline - the above repro actually fails in eager mode too. In order to have the error show up in eager though, you have to actually run out.sum().backward()
, which gives:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 2]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
What's slightly different when we use the compile stack is that we eagerly trace the forward and backward into a joint graph, when the user calls their module's forward(). And tracing through the backward graph causes the error to show up.
Thanks @bdhirsh for pointing out the real issue behind this. I checked several other same failures occurred at 7k github model, but all of them are caused by the above reason that @bdhirsh mentioned. This is not a real error, I'll close this.
🐛 Describe the bug
This may be the same bug as pytorch/pytorch#93440, but we have a minimized repro here. More evidences:
Minimized repro:
Error logs
Minified repro
No response