microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.18k stars 288 forks source link

resnet fails to converge on DML #404

Open linnealovespie opened 1 year ago

linnealovespie commented 1 year ago
          I'm facing similar issues on ResNet. Accuracy was very low and does not improve with epochs on DML but when I switched to CPU or CUDA the network behaved normally.

I tried reinstalling torch/torch-directml as well as creating a new env from scratch, but nothing worked. torch-dml version is 0.1.13.1.dev230119.

Originally posted by @ianlamfar in https://github.com/microsoft/DirectML/issues/359#issuecomment-1407713346

ianlamfar commented 1 year ago

Just updating some hardward and driver specs for my machine at the time: CPU: AMD 7950X (non-3D) GPU: AMD Radeon 6900XT, driver 22.11.2 OS: 1) Windows: 22H2 (i forgot the specific version at the time) 2) WSL Ubuntu 20.04 LTS

UPDATE: With the newest version of torch_dml==0.1.13.1.dev230301, the issue appears to be present in both WSL and Windows. Upon further testing, it seems that the gradients of the parameters were never populated even after loss.backward(). It will return None for param.grad. The grads are correctly populated if using torch.device('cpu') or CUDA (tested on cloud machines).

zhangxiang1993 commented 1 year ago

Hi @ianlamfar,

We didn't repro the issue with our resnet sample on Radeon RX Vega. The validation accuracy looks good, it reaches 80% after 20 epochs.

it seems that the gradients of the parameters were never populated even after loss.backward(). It will return None for param.grad

It would be helpful to diagnose if you could share your source code, I'm curious how is the optimizer defined specifically.

ianlamfar commented 1 year ago

Hi @zhangxiang1993,

Thanks for getting back to me.

I have included an ipynb demonstrating the None grads problem.

With your hint to the optimiser definition I realised that defining the optimizer outside the train_part() has created this problem. When I defined the optimizer from within the train_part() all problems solved.

However, when this exact code is executed on other devices (cuda and cpu) this problem does not exist. It might be worth investigating why there is such a different behaviour on optimiser definition.

Many thanks.