Open shenh10 opened 5 months ago
hmm @bhavya01 Can you take a look at this issue since you are offcall this week? I am not sure why torch._foreach_add_(device_params, device_grads, alpha=-lr)
will trigger the execution of the graph instead of record the operation in graph. My best guess that we didn't lower _foreach_add_
.
@shenh10 Can you help reproduce the issue with a simpler demo? I tried running a simpler script but didn't see this happening.
This gist contains the script that I used and the output. https://gist.github.com/bhavya01/0346b0d47931ba60751dbe79b01268a0
@shenh10 Gentle ping if you can help reproduce the issue.
@shenh10 Gentle ping if you can help reproduce the issue.
I apologize for the delay in my response as I have not been able to dedicate time to this task recently. I will provide you with an update by the end of next week.
🐛 Bug
torch/benchmarks/dynamo testing suit sets SGD as the optimizer and set
foreach
flag to be True, to leverage the fast implementation offoreach
cuda implementation. However it seems calling multi_tensor_sgd would trigger lazy tensor execution withopenxla
backend.From XLA debug info it seems
multi_tensor_sgd
is called and early results are expected from torch._foreachadd(device_params, device_grads, alpha=-lr)To Reproduce
Use any demo and modify optimizer to SGD like above
Environment