microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.19k stars 290 forks source link

Very low validation and testing accuracy on CNN #359

Open AtiqurRahmanAni opened 1 year ago

AtiqurRahmanAni commented 1 year ago

Hello everyone. I am facing an issue. I am explaining what I am trying to do. I have a Traffic and Road sign dataset that contains 43 classes. I am trying to classify the images. I am using the resnet34 pre-trained model. I have AMD RX6600 GPU that I use for running the model. For running the model on my AMD GPU I am using Pytorch Directml. Until now everything has worked fine. Training speed is fast enough, and GPU utilization is near 100%. Training loss decreases per epoch. But when I check the model using validation data after one training phase, validation loss increases and validation accuracy is too low. But training is ok. When I run the same code on my friend’s PC who has NVIDIA GPU, all is ok. Validation loss decreases and it converges. And I got an accuracy of 98% when running the same code on NVIDIA GPU. I can not figure out what the problem is. I also tune the hyperparameter but had no luck. And one strange thing is that this problem arises when I use CNN based model. I had run NLP pre-trained model BERT on my AMD GPU and there is no Issue. Validation loss decreases and it converges. Can anyone help me with this issue? I am giving the code below. Thanks in advance. Screenshot 2023-01-03 221733

ianlamfar commented 1 year ago

I'm facing similar issues on ResNet. Accuracy was very low and does not improve with epochs on DML but when I switched to CPU or CUDA the network behaved normally. I tried reinstalling torch/torch-directml as well as creating a new env from scratch, but nothing worked. torch-dml version is 0.1.13.1.dev230119.

linnealovespie commented 1 year ago

Hi @ianlamfar, I've created a separate issue for your comment. Could you comment in #404 what GPU you're using?

ianlamfar commented 1 year ago

404