Performance issues on Nvidia GPUs with mixed precision and accuracy issues

Discussed in https://github.com/microsoft/tensorflow-directml-plugin/discussions/315

^{Originally posted by **aliencaocao** October 19, 2022} I have done a simple benchmark of ResNetRS50 on an RTX 3080Ti, comparing DirectML plugin 0.1.1.dev221004 and CUDA 11.8 + CUDNN 8.6.0, and found that DML is very slow compared to CUDA, and uses only about 50% of GPU while training, while CUDA constantly uses 100%. Both tests were conducted with mixed precision off and batch size of 64. Training 10 epochs on DML took 416 seconds, while on CUDA took only 164 seconds. Both on TF 2.10 (CPU for DML) and Python 3.9.13. This brings the big performance question - is DML in any case optimized for Nvidia GPUs, especially its Tensor Cores and TensorFloat32 datatypes? And what could cause it to not use 100% of my GPU? I have tried to increase batch size but it will just OOM so 64 is definitely a large enough BS to fully use the GPU (as shown by 100% usage on CUDA). Or perhaps is this something that will be optimized in the future, but just not yet?

UPDATE TLDR: the performance issues mentioned above have been partially resolved in 0.2.0, but the fix introduced a model accuracy loss issue that have yet to be resolved. See https://github.com/microsoft/tensorflow-directml-plugin/discussions/315#discussioncomment-3930093 This makes the plugin not worth to switch over on Nvidia Ampere GPUs (and potentially other nvidia GPUs). Mixed precision is able to run but with poor performance as of now (on 0.1.1 it was unable to run)

microsoft / tensorflow-directml-plugin

Performance issues on Nvidia GPUs with mixed precision and accuracy issues #350

Discussed in https://github.com/microsoft/tensorflow-directml-plugin/discussions/315