Open linnealovespie opened 1 year ago
Just updating some hardward and driver specs for my machine at the time: CPU: AMD 7950X (non-3D) GPU: AMD Radeon 6900XT, driver 22.11.2 OS: 1) Windows: 22H2 (i forgot the specific version at the time) 2) WSL Ubuntu 20.04 LTS
UPDATE:
With the newest version of torch_dml==0.1.13.1.dev230301
, the issue appears to be present in both WSL and Windows.
Upon further testing, it seems that the gradients of the parameters were never populated even after loss.backward()
. It will return None
for param.grad
. The grads are correctly populated if using torch.device('cpu') or CUDA (tested on cloud machines).
Hi @ianlamfar,
We didn't repro the issue with our resnet sample on Radeon RX Vega. The validation accuracy looks good, it reaches 80% after 20 epochs.
it seems that the gradients of the parameters were never populated even after loss.backward(). It will return None for param.grad
It would be helpful to diagnose if you could share your source code, I'm curious how is the optimizer defined specifically.
Hi @zhangxiang1993,
Thanks for getting back to me.
I have included an ipynb demonstrating the None
grads problem.
With your hint to the optimiser definition I realised that defining the optimizer outside the train_part()
has created this problem. When I defined the optimizer
from within the train_part()
all problems solved.
However, when this exact code is executed on other devices (cuda and cpu) this problem does not exist. It might be worth investigating why there is such a different behaviour on optimiser definition.
Many thanks.
I tried reinstalling torch/torch-directml as well as creating a new env from scratch, but nothing worked. torch-dml version is 0.1.13.1.dev230119.
Originally posted by @ianlamfar in https://github.com/microsoft/DirectML/issues/359#issuecomment-1407713346