microsoft / tensorflow-directml-plugin

DirectML PluggableDevice plugin for TensorFlow 2
Apache License 2.0
179 stars 23 forks source link

Training VGG13 net with RX6600 is slow #326

Open thinksmert opened 1 year ago

thinksmert commented 1 year ago

my environment: windows 11 64bit python 3.9 64bit tensorflow 2.10 tensorflow-directml-plugin 0.2.0.dev221020 AMD Radeon RX 6600 Nvidia RTX1060 Conda 22.9.0

I'm training a VGG13 net in miniConda enviroment.I have two configurations: 1.Nvidia RTX1060 + tensorflow-gpu 2.RX6600(more powerful than RTX1060) + tensorflow-cpu + tensorflow-directml-plugin With first configuration,it is very fast, about 6s each train period.But with second configuration,it is slower than the first configuration,only about 30s each train period. I guess the reason of second configuration is slower, is it just uses tensorflow-cpu not tensorflow-gpu?Is it right? Is there any way can improve the trainning speed with that second configuration? Or when tensorflow-directml-plugin can support tensorflow-gpu?

Thanks

PatriceVignola commented 1 year ago

Hi @thinksmert,

  1. Could you try running the model on your RTX1060 with tensorflow-cpu + tensorflow-directml-plugin and give us the numbers.
  2. tensorflow-cpu + tensorflow-directml-plugin is supposed to use your GPU, so it's not clear why it is much slower. It's possible that some operators are falling back to the CPU, which can considerably slow down the execution.

Can you send us the device placement logs? Just add the following snippet at the start of your script:

import tensorflow as tf
tf.debugging.set_log_device_placement(True)

and then, redirect the output to a file. For example:

python script.py > log.txt
thinksmert commented 1 year ago

Hi, OK, I will try it later.

thinksmert commented 1 year ago

And I wonder if the plugin will support the tensorflow-gpu?

thinksmert commented 1 year ago

Hi, I have add your snippet in my script and this is the log file when I run my script about 30s with the second configuration.For your reference. Thanks log.txt

PatriceVignola commented 1 year ago

@thinksmert Thanks! Can you do the same thing with the tensorflow-gpu package (and without tensorflow-directml-plugin) on your Nvidia card? This will help us compare what is supposed to happen versus what is actually happening.

thinksmert commented 1 year ago

Hi, OK,I will try to do that,maybe a few days later because I need take down my RX6600 and install Nvidia card again.It will take some times.

thinksmert commented 1 year ago

Hi, I have do two tests with my Nvidia card.One is using tensorflow-gpu and it takes about 7s per training period.Another test is using tensorflow-cpu and tensorflow-directml-plugin.It used more time(about 20s per training period) but still faster than RX6600 with tensorflow-cpu.Here are the logs: first is RTX1060 with tensorflow-gpu second is RTX1060 with tensorflow-cpu and tensorflow-directml-plugin

For your reference. Thanks

log_gpu_GTX1060.txt log_cpu_GTX1060.txt

thinksmert commented 1 year ago

Hi, Is there any idea?

PatriceVignola commented 1 year ago

The logs are identical between DML and CUDA, so it's hard to say just from that. Can I ask where you got that VGG13 script from? Running the exact same script would help us investigate this on our end.

thinksmert commented 1 year ago

Hi, This script is just an exercise when I study ML from the network tutorial.I coded it flow the tutorial setp by step.These logs I gave you run the same script. Thanks