Low performance on RX 580 with TF benchmarks

microsoft / tensorflow-directml

Fork of TensorFlow accelerated by DirectML

Apache License 2.0

458 stars 32 forks source link

Low performance on RX 580 with TF benchmarks #42

Open MatPoliquin opened 4 years ago

MatPoliquin commented 4 years ago

I get low performance on TF benchmarks with my RX 580: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks

using their example command: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

I get this error and performance result: 2020-06-19 16:01:17.369204: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost'

Step Img/sec total_loss 1 images/sec: 4.8 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 4.7 +/- 0.0 (jitter = 0.1) 7.593 20 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.696 30 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.753 40 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 8.007 50 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 7.520 60 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.989 70 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 8.028 80 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.932 90 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.850 100 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.798

total images/sec: 4.90

Note:

GPU and VRAM usage are at 100%, so it's not using the CPU
I get around 88 image/s on latest version of ROCm (Ubuntu 20.04) with this computer

Info:

RX 580 8GB driver 26.20.12028.2
Dual Intel Xeon 2680 v2
64 GB ram
Windows 10 2004
OSbuild 19041.329
python 3.7

sunshinejnjn commented 4 years ago

GTX1080Ti got total 32.48 images/sec.

E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost' in binary running on DT. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. Done warm up Step Img/sec total_loss 1 images/sec: 35.7 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 32.2 +/- 1.1 (jitter = 4.0) 7.593 20 images/sec: 32.7 +/- 0.7 (jitter = 3.3) 7.696 30 images/sec: 32.3 +/- 0.7 (jitter = 3.7) 7.753 40 images/sec: 32.3 +/- 0.6 (jitter = 4.5) 8.007 50 images/sec: 32.7 +/- 0.5 (jitter = 4.2) 7.520 60 images/sec: 32.8 +/- 0.5 (jitter = 3.9) 7.988 70 images/sec: 32.5 +/- 0.5 (jitter = 3.9) 8.028 80 images/sec: 32.6 +/- 0.4 (jitter = 3.7) 7.932 90 images/sec: 32.4 +/- 0.4 (jitter = 4.0) 7.850 100 images/sec: 32.5 +/- 0.4 (jitter = 3.9) 7.795

total images/sec: 32.48

Tensorflow-GPU 1.15.3 (official) with cuda got 174.89 images/sec.

The system is an AMD R7 1700X with 64GB RAM. Windows 10 20H1. Also, half of the 1st screen flashes a little bit for a few times throughout the benchmark.

PatriceVignola commented 4 years ago

Thank you for reporting your benchmark results! This is a preview and we only have a limited set of operators implemented at the moment, so results like this are not totally unexpected. As operator support gets closer to what CUDA/ROCm supports, we expect performance to get better and we'll be able to focus on it a lot more. We'll definitely look into this benchmark though and see where the bottlenecks are.

ashaver commented 4 years ago

First, absolutely thank you! Being able to do this from any OS that supports DirectX12, that is amazing. Second, if I can help, let me know.

Summary:

Seeing a comparison right now of about 128 images/sec on ROCm versus 21 images/sec on DirectML.
Also, the jitter on ROCm is two orders of magnitude larger. (Maybe that helps make a fair comparison.)
The optimal batch size for my hardware for ROCm is 32 and DirectML is 16. Using a batch of 32 on DirectML was 3x slower.
The stack for DirectML (Windows/AMD Driver/DirectML) is so much more stable than the ROCm stack (Linux/AMD Driver/ROCm). ROCm sometimes is not able to even set clock frequencies (and has never been able to control fans), referencing post in ROCm speed comparison I made. I cannot tell you how much I appreciate a stable stack. Mean time to failure using ROCm was around an hour, which precludes any significant work (especially when the only recovery from failure is to reboot). I have not had any issues with DirectML.

Hardware: Stock laptop Acer Predator Helios 500 PH517-61-R0GX Gaming Laptop, AMD Ryzen 7 2700 Desktop Processor, AMD Radeon RX Vega 56

DirectML Results (python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50)

Step    Img/sec total_loss
1       images/sec: 20.3 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 20.6 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 20.6 +/- 0.1 (jitter = 0.2) 7.726
30      images/sec: 20.5 +/- 0.1 (jitter = 0.2) 7.360
40      images/sec: 20.6 +/- 0.0 (jitter = 0.3) 7.526
50      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.999
70      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.978
80      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.884
90      images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.924
100     images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.848
----------------------------------------------------------------
total images/sec: 20.65
----------------------------------------------------------------

Results with --enable_optimizations=0 (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0):

Step    Img/sec total_loss
1       images/sec: 30.1 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 30.1 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 30.1 +/- 0.1 (jitter = 0.1) 7.726
30      images/sec: 30.2 +/- 0.1 (jitter = 0.2) 7.360
40      images/sec: 30.1 +/- 0.0 (jitter = 0.2) 7.527
50      images/sec: 30.1 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.999
70      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.978
80      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.884
90      images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.925
100     images/sec: 30.1 +/- 0.0 (jitter = 0.2) 7.848
----------------------------------------------------------------
total images/sec: 27.27
----------------------------------------------------------------

ROCm Results (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50):

Step    Img/sec total_loss
1   images/sec: 131.4 +/- 0.0 (jitter = 0.0)    8.458
10  images/sec: 130.0 +/- 0.9 (jitter = 2.9)    7.997
20  images/sec: 129.1 +/- 0.6 (jitter = 2.2)    8.260
30  images/sec: 128.6 +/- 0.5 (jitter = 2.0)    8.338
40  images/sec: 128.4 +/- 0.4 (jitter = 2.3)    8.190
50  images/sec: 128.0 +/- 0.4 (jitter = 2.7)    7.742
60  images/sec: 128.2 +/- 0.4 (jitter = 2.4)    8.061
70  images/sec: 128.3 +/- 0.3 (jitter = 2.4)    inf
80  images/sec: 128.3 +/- 0.3 (jitter = 2.5)    inf
90  images/sec: 128.2 +/- 0.3 (jitter = 2.5)    inf
100 images/sec: 128.2 +/- 0.3 (jitter = 2.5)    inf
----------------------------------------------------------------
total images/sec: 128.13
----------------------------------------------------------------

adtsai commented 4 years ago

It's great to hear that the DirectML stack is working well for you! These results are interesting, and it's good to hear that it's behaving in a stable manner because stability and correctness is something we invest a lot of time on.

As @PatriceVignola mentioned this is a super early preview and we're still working hard on it, so you can definitely expect the performance to improve as time goes on. For example I suspect one of the reasons why --batch_size 32 is so much slower on DML is because we haven't optimized our memory allocator yet, which means that at high batch sizes we end up using more VRAM than necessary in some circumstances, which leads to a performance cliff. But rest assured we're working on it. :)

PatriceVignola commented 4 years ago

@MatPoliquin , @sunshinejnjn , @ashaver , we just uploaded a new package that improves the performance of TensorFlow DirectML devices across the board. The package (1.15.3.dev200626) is now on pypi and can be installed it with

pip install tensorflow-directml

if it's your first time installing it or

pip install tensorflow-directml --upgrade

if you installed the previous 1.15.3.dev200619 release.

On a Radeon RX Vega, we see a ~63% performance increase for batch_size=16 and a ~47% performance increase for batch_size=32. These improvements are not limited to AMD cards though, so we are expecting similar improvements for Nvidia and Intel graphics.

We realize that there is still a lot of room for improvement to catch up with ROCm and CUDA, but we aim to release packages regularly and keep the community updated on our progress. All feedback and data that we receive is very helpful as we work on closing the performance and functionality gap.

Here are the full results for a Radeon RX Vega with a batch size of 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 36.6 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 35.7 +/- 0.2 (jitter = 0.0) 7.854
20      images/sec: 35.6 +/- 0.1 (jitter = 0.0) 7.726
30      images/sec: 35.6 +/- 0.1 (jitter = 0.0) 7.360
40      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.526
50      images/sec: 35.6 +/- 0.1 (jitter = 0.0) 8.171
60      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.999
70      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.978
80      images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.884
90      images/sec: 35.5 +/- 0.1 (jitter = 1.7) 7.924
100     images/sec: 35.5 +/- 0.1 (jitter = 1.7) 7.848
----------------------------------------------------------------
total images/sec: 35.48
----------------------------------------------------------------

And here are the full results for a Radeon RX Vega with a batch size of 32:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 8.8 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 9.3 +/- 0.1 (jitter = 0.7)  7.593
20      images/sec: 9.3 +/- 0.1 (jitter = 0.4)  7.696
30      images/sec: 9.3 +/- 0.1 (jitter = 0.5)  7.753
40      images/sec: 9.3 +/- 0.1 (jitter = 0.4)  8.007
50      images/sec: 9.3 +/- 0.1 (jitter = 0.4)  7.520
60      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.990
70      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  8.028
80      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.931
90      images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.851
100     images/sec: 9.3 +/- 0.0 (jitter = 0.4)  7.797
----------------------------------------------------------------
total images/sec: 9.26
----------------------------------------------------------------

Edit: Clarify package release timelines.

MatPoliquin commented 4 years ago

Just tried the new 1.15.3.dev200626 version, I actually get worst performance on RX580

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 3.9 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.593
20      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.696
30      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.753
40      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  8.007
50      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.520
60      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.988
70      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  8.029
80      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.932
90      images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.850
100     images/sec: 4.0 +/- 0.0 (jitter = 0.1)  7.799
----------------------------------------------------------------
total images/sec: 4.04
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 10.3 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 10.5 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 10.5 +/- 0.1 (jitter = 0.3) 7.726
30      images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.360
40      images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.527
50      images/sec: 10.7 +/- 0.1 (jitter = 0.3) 8.171
60      images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.999
70      images/sec: 10.7 +/- 0.0 (jitter = 0.4) 7.978
80      images/sec: 10.8 +/- 0.0 (jitter = 0.4) 7.884
90      images/sec: 10.8 +/- 0.0 (jitter = 0.5) 7.924
100     images/sec: 10.9 +/- 0.0 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 10.88
----------------------------------------------------------------

PatriceVignola commented 4 years ago

This is interesting. I don't have access to an RX 580 at the moment, but we tried with 3 different AMD cards (Radeon VII, Radeon RX Vega and Radeon RX 5700 XT) and saw a 50% performance increase on average. I have a few questions to help me understand the issue:

Are you running the benchmark on WSL or on Windows?
Do you have other graphics cards on your machine?
If you run by disabling grappler optimizations (python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0), does it get better or worse?

Also, if you don't mind, could you take a trace, upload it somewhere and send us the link?

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --trace_file=trace.json

MatPoliquin commented 4 years ago

EDIT: for some reason I did not install the 1.15.3.dev200626 version properly, I reinstalled it and now I get better performance

Windows
only one GPU

here is the result:

Step    Img/sec total_loss
1       images/sec: 4.9 +/- 0.0 (jitter = 0.0)  8.169
10      images/sec: 4.8 +/- 0.0 (jitter = 0.1)  7.593
20      images/sec: 4.9 +/- 0.0 (jitter = 0.1)  7.696
30      images/sec: 4.9 +/- 0.0 (jitter = 0.1)  7.753
40      images/sec: 5.0 +/- 0.0 (jitter = 0.1)  8.007
50      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.520
60      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.988
70      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  8.029
80      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.932
90      images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.850
100     images/sec: 5.0 +/- 0.0 (jitter = 0.2)  7.799
----------------------------------------------------------------
total images/sec: 4.98
----------------------------------------------------------------

Here is the zipped trace.json file trace.zip

Note: The performance increase is more noticeable with --batch=16:

Step    Img/sec total_loss
1       images/sec: 20.2 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.854
20      images/sec: 20.1 +/- 0.1 (jitter = 0.2) 7.726
30      images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.360
40      images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.527
50      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.999
70      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.978
80      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.884
90      images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.924
100     images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.847
----------------------------------------------------------------
total images/sec: 20.18
----------------------------------------------------------------

sunshinejnjn commented 4 years ago

I'm testing this thing on various devices including intel iGPU, amd iGPU (amd dGPU on native linux, not tested for now), nvidia 9xx 10xx 20xx systems. Especially, iGPUs are the most interesting part. AMD ryzen 4500U with vega 6 failed to run the benchmark with build 200615. The system froze when running it. And it seemed to be a gpu reset with a beep after 1 minute or so, then reported some error as output. with build 200626, the situation is similar, no beep reset but still unable to run. System version: Windows 10 2020H1 Driver version: AMD 27.20.1017.1011 (dated 20200525, newest amd gpu driver at present, Adrenalin 2020 Edition 20.5.1)

Another device which is a dell xps 15 9550, i7-6700HQ with intel HD 530 running latest Intel beta driver. It ran this at 1.8 images/sec (on intel iGPU). Windows 10 2020H1, TF-DML build 200626.

Later, I'm gonna test this on a intel i5-4000 with iGPU to see if it can run.

PatriceVignola commented 4 years ago

@MatPoliquin Ah, these numbers make more sense. Thank you for double checking! Like @adtsai said, we didn't optimize our memory allocator yet so the performance increase for larger batch sizes is less noticeable and we end up utilizing more memory than necessary, but we're working on improving it.

@sunshinejnjn What are the models of the iGPUs/dGPUs that crashed or froze while running the benchmark?

oscarbg commented 4 years ago

with 26-6-2020 package (tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64)

my results on Titan V (451.58):

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 95.8 +/- 0.0 (jitter = 0.0) 8.169
10      images/sec: 94.8 +/- 0.4 (jitter = 1.1) 7.593
20      images/sec: 95.1 +/- 0.2 (jitter = 0.9) 7.696
30      images/sec: 94.8 +/- 0.2 (jitter = 1.1) 7.753
40      images/sec: 94.9 +/- 0.2 (jitter = 0.7) 8.006
50      images/sec: 94.7 +/- 0.1 (jitter = 1.0) 7.520
60      images/sec: 94.6 +/- 0.1 (jitter = 0.9) 7.989
70      images/sec: 94.5 +/- 0.1 (jitter = 0.8) 8.028
80      images/sec: 94.5 +/- 0.1 (jitter = 0.8) 7.930
90      images/sec: 94.4 +/- 0.1 (jitter = 0.8) 7.849
100     images/sec: 94.3 +/- 0.1 (jitter = 0.9) 7.795
----------------------------------------------------------------
total images/sec: 94.29
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 80.1 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 80.0 +/- 0.3 (jitter = 0.6) 7.854
20      images/sec: 80.1 +/- 0.1 (jitter = 0.3) 7.726
30      images/sec: 80.0 +/- 0.1 (jitter = 0.3) 7.360
40      images/sec: 80.0 +/- 0.1 (jitter = 0.4) 7.527
50      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 8.171
60      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.999
70      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.978
80      images/sec: 79.9 +/- 0.1 (jitter = 0.5) 7.884
90      images/sec: 79.8 +/- 0.1 (jitter = 0.5) 7.924
100     images/sec: 79.7 +/- 0.1 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 79.72
----------------------------------------------------------------

on vega56 (20.20 branch 27.20.2001.5002) similar to others

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 36.9 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 37.0 +/- 0.0 (jitter = 0.1) 7.854
20      images/sec: 36.9 +/- 0.0 (jitter = 0.1) 7.726
30      images/sec: 36.9 +/- 0.0 (jitter = 0.1) 7.360
40      images/sec: 36.8 +/- 0.0 (jitter = 0.1) 7.526
50      images/sec: 36.8 +/- 0.0 (jitter = 0.2) 8.171
60      images/sec: 36.7 +/- 0.1 (jitter = 0.2) 7.999
70      images/sec: 36.4 +/- 0.1 (jitter = 0.3) 7.978
80      images/sec: 36.2 +/- 0.1 (jitter = 0.3) 7.884
90      images/sec: 36.0 +/- 0.1 (jitter = 0.4) 7.924
100     images/sec: 35.9 +/- 0.1 (jitter = 0.6) 7.848
----------------------------------------------------------------
total images/sec: 35.91
----------------------------------------------------------------

EDIT: adding CUDA Titan V scores: so seems 3X improvement vs current DirectML..

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step    Img/sec total_loss
1       images/sec: 285.6 +/- 0.0 (jitter = 0.0)        7.765
10      images/sec: 289.8 +/- 1.4 (jitter = 3.9)        8.049
20      images/sec: 289.9 +/- 0.8 (jitter = 1.9)        7.808
30      images/sec: 289.3 +/- 0.7 (jitter = 3.7)        7.976
40      images/sec: 289.8 +/- 0.6 (jitter = 3.8)        7.591
50      images/sec: 289.8 +/- 0.5 (jitter = 3.7)        7.549
60      images/sec: 289.5 +/- 0.4 (jitter = 3.7)        7.819
70      images/sec: 289.4 +/- 0.4 (jitter = 3.8)        7.821
80      images/sec: 289.5 +/- 0.4 (jitter = 3.8)        7.849
90      images/sec: 289.3 +/- 0.4 (jitter = 3.8)        8.027
100     images/sec: 289.4 +/- 0.4 (jitter = 3.8)        8.030
----------------------------------------------------------------
total images/sec: 289.27
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 235.5 +/- 0.0 (jitter = 0.0)        8.034
10      images/sec: 237.0 +/- 1.2 (jitter = 5.0)        7.686
20      images/sec: 236.5 +/- 0.9 (jitter = 5.0)        7.657
30      images/sec: 236.7 +/- 0.7 (jitter = 5.0)        8.194
40      images/sec: 237.0 +/- 0.6 (jitter = 5.1)        7.897
50      images/sec: 236.9 +/- 0.5 (jitter = 5.0)        7.999
60      images/sec: 236.9 +/- 0.5 (jitter = 4.9)        7.912
70      images/sec: 236.9 +/- 0.4 (jitter = 4.9)        8.180
80      images/sec: 236.9 +/- 0.4 (jitter = 4.9)        8.351
90      images/sec: 236.8 +/- 0.4 (jitter = 4.9)        8.115
100     images/sec: 237.1 +/- 0.4 (jitter = 5.0)        7.822
----------------------------------------------------------------
total images/sec: 237.04
----------------------------------------------------------------

sofiageo commented 4 years ago

Adding my AMD 5700XT results (no overclock) in case it helps verify the expected outcome

Windows 10 2004 - AMD Drivers 20.5.1-ghs-beta

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 36.3 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 37.6 +/- 0.4 (jitter = 1.4) 7.854
20      images/sec: 38.2 +/- 0.3 (jitter = 2.0) 7.726
30      images/sec: 38.5 +/- 0.2 (jitter = 2.0) 7.360
40      images/sec: 38.4 +/- 0.2 (jitter = 2.0) 7.526
50      images/sec: 38.3 +/- 0.2 (jitter = 2.0) 8.171
60      images/sec: 38.1 +/- 0.2 (jitter = 2.0) 7.999
70      images/sec: 38.0 +/- 0.2 (jitter = 2.0) 7.978
80      images/sec: 37.9 +/- 0.1 (jitter = 2.0) 7.884
90      images/sec: 38.0 +/- 0.1 (jitter = 2.0) 7.924
100     images/sec: 37.9 +/- 0.1 (jitter = 2.0) 7.848
----------------------------------------------------------------
total images/sec: 37.94
----------------------------------------------------------------

and updated drivers 20.7.1

Step    Img/sec total_loss
1       images/sec: 39.4 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 39.3 +/- 0.1 (jitter = 0.1) 7.854
20      images/sec: 38.9 +/- 0.1 (jitter = 0.6) 7.726
30      images/sec: 38.7 +/- 0.1 (jitter = 0.6) 7.360
40      images/sec: 38.6 +/- 0.1 (jitter = 0.6) 7.526
50      images/sec: 38.7 +/- 0.1 (jitter = 0.6) 8.171
60      images/sec: 38.8 +/- 0.1 (jitter = 0.6) 7.999
70      images/sec: 38.8 +/- 0.1 (jitter = 0.5) 7.978
80      images/sec: 38.8 +/- 0.1 (jitter = 0.4) 7.884
90      images/sec: 38.8 +/- 0.0 (jitter = 0.4) 7.924
100     images/sec: 38.8 +/- 0.0 (jitter = 0.4) 7.848
----------------------------------------------------------------
total images/sec: 38.80
----------------------------------------------------------------

limyz commented 4 years ago

Windows 10 2004/AMD Driver v27.20.1017.1011/DirectML

AMD R9 290


python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

1 images/sec: 4.9 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 4.9 +/- 0.0 (jitter = 0.0) 7.593 20 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.696 30 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 7.753 40 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 8.007 50 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 7.520 60 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 7.990 70 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 8.028 80 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 7.931 90 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 7.851 100 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 7.797

total images/sec: 4.96

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0 --trace_file=trace.json

Step Img/sec total_loss 1 images/sec: 5.3 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 7.593 20 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 7.696 30 images/sec: 5.2 +/- 0.0 (jitter = 0.1) 7.753 40 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 8.007 50 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 7.519 60 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 7.989 70 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 8.028 80 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 7.933 90 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 7.851 100 images/sec: 5.3 +/- 0.0 (jitter = 0.1) 7.795

total images/sec: 5.27

[trace_r290.zip](https://github.com/microsoft/DirectML/files/4881411/trace_r290.zip)

* AMD Ryzen 4700U on Vega 7 Graphics

python .\tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50 --variable_update=parameter_server

Step Img/sec total_loss 1 images/sec: 5.8 +/- 0.0 (jitter = 0.0) 7.993 10 images/sec: 5.8 +/- 0.0 (jitter = 0.0) 7.854 20 images/sec: 5.8 +/- 0.0 (jitter = 0.0) 7.726 30 images/sec: 5.8 +/- 0.0 (jitter = 0.0) 7.360 40 images/sec: 5.7 +/- 0.0 (jitter = 0.0) 7.526 50 images/sec: 5.7 +/- 0.0 (jitter = 0.0) 8.171 60 images/sec: 5.7 +/- 0.0 (jitter = 0.0) 7.999 70 images/sec: 5.7 +/- 0.0 (jitter = 0.0) 7.978 80 images/sec: 5.8 +/- 0.0 (jitter = 0.0) 7.884 90 images/sec: 5.8 +/- 0.0 (jitter = 0.0) 7.924 100 images/sec: 5.8 +/- 0.0 (jitter = 0.0) 7.848

total images/sec: 5.77

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50 --enable_optimizations=0 --trace_file=trace.json

Step Img/sec total_loss 1 images/sec: 5.7 +/- 0.0 (jitter = 0.0) 7.993 10 images/sec: 6.2 +/- 0.1 (jitter = 0.0) 7.854 20 images/sec: 6.2 +/- 0.0 (jitter = 0.0) 7.726 30 images/sec: 6.2 +/- 0.0 (jitter = 0.0) 7.360 40 images/sec: 6.2 +/- 0.0 (jitter = 0.0) 7.527 50 images/sec: 6.3 +/- 0.0 (jitter = 0.0) 8.171 60 images/sec: 6.3 +/- 0.0 (jitter = 0.0) 7.999 70 images/sec: 6.3 +/- 0.0 (jitter = 0.0) 7.978 80 images/sec: 6.3 +/- 0.0 (jitter = 0.0) 7.884 90 images/sec: 6.3 +/- 0.0 (jitter = 0.0) 7.925 100 images/sec: 6.3 +/- 0.0 (jitter = 0.0) 7.848

total images/sec: 6.27


[trace_vega7.zip](https://github.com/microsoft/DirectML/files/4881375/trace_vega7.zip)

oscarbg commented 4 years ago

Hi, performance is worse with new update (tensorflow-directml 1.15.3.dev200911) for example using on Titan V (460.15): python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

now I get:

Step    Img/sec total_loss
1       images/sec: 55.7 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 56.2 +/- 0.1 (jitter = 0.3) 7.854
20      images/sec: 56.3 +/- 0.2 (jitter = 0.4) 7.726
30      images/sec: 56.2 +/- 0.1 (jitter = 0.4) 7.360
40      images/sec: 56.1 +/- 0.1 (jitter = 0.5) 7.527
50      images/sec: 56.1 +/- 0.1 (jitter = 0.5) 8.171
60      images/sec: 55.5 +/- 0.3 (jitter = 0.5) 7.999
70      images/sec: 55.5 +/- 0.2 (jitter = 0.7) 7.978
80      images/sec: 55.5 +/- 0.2 (jitter = 0.7) 7.884
90      images/sec: 55.4 +/- 0.2 (jitter = 0.8) 7.924
100     images/sec: 55.4 +/- 0.2 (jitter = 0.8) 7.847
----------------------------------------------------------------
total images/sec: 55.39
----------------------------------------------------------------

on June was getting:

Step    Img/sec total_loss
1       images/sec: 80.1 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 80.0 +/- 0.3 (jitter = 0.6) 7.854
20      images/sec: 80.1 +/- 0.1 (jitter = 0.3) 7.726
30      images/sec: 80.0 +/- 0.1 (jitter = 0.3) 7.360
40      images/sec: 80.0 +/- 0.1 (jitter = 0.4) 7.527
50      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 8.171
60      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.999
70      images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.978
80      images/sec: 79.9 +/- 0.1 (jitter = 0.5) 7.884
90      images/sec: 79.8 +/- 0.1 (jitter = 0.5) 7.924
100     images/sec: 79.7 +/- 0.1 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 79.72
----------------------------------------------------------------

I get on console output lots of messages like posted below.. this seems to point to the performance issue as seems to do now some work on DirectML CPU backend (DML CPU):

2020-09-20 21:29:44.368137: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:DML:0
  /job:localhost/replica:0/task:0/device:DML:1
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[DML, CPU] possible_devices_=[]
Assign: DML CPU
Const: DML CPU
VariableV2: DML CPU
Identity: DML CPU
ApplyGradientDescent: DML CPU
IsVariableInitialized: DML CPU

seems wasn't getting this on June testing.. full log of the run: fullogdmlcpu.txt

PatriceVignola commented 4 years ago

@oscarbg Were you not getting these logs with the previous package? I think these logs are expected when running tf_cnn_benchmarks since it doesn't know anything about DML, but it should still try to fallback to DML instead of the CPU when possible. We'll investigate the performance regression.

metemadi commented 4 years ago

Wanted to add my results! First off this is some amazing work - really helps increase the accessibility of ML tools!!! I am running a Zephyrus G14 which has two GPUs, the integrated Radeon and a 2060-MaxQ. Running in windows for now (will try WSL-2 soon as well). Can confirm that I can get bigger batch sizes using the Radeon (which has access to 40GB of installed system ram - yes.. 8gb soldered with a 32gb stick.. yay!) than I can with the Max-Q (6gb vram only). This really opens up a lot of possibilities.. but.. its all just pretty slow compared to the CPU (the 4900HS).

First.. the CPU:

python tf_cnn_benchmarks.py --batch_size=16 --model=resnet50 --enable_optimizations=0 --device='cpu' --data_format='NHWC' --num_batches=30
Step    Img/sec total_loss
1       images/sec: 3.8 +/- 0.0 (jitter = 0.0)  7.780
10      images/sec: 3.7 +/- 0.0 (jitter = 0.0)  7.877
20      images/sec: 3.7 +/- 0.0 (jitter = 0.1)  7.744
30      images/sec: 3.7 +/- 0.0 (jitter = 0.1)  7.672
----------------------------------------------------------------
total images/sec: 3.69
----------------------------------------------------------------

Next - the Radeon (I had to disable the Max-Q in device manager.. couldn't find an easy way to make the gigantic Tensorflow benchmark pick /dml:1 which is the Radeon, /dml:0 being the Max-Q):

python tf_cnn_benchmarks.py --batch_size=16 --model=resnet50 --enable_optimizations=0 --data_format='NHWC' --num_batches=30
Step    Img/sec total_loss
1       images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.780
10      images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.877
20      images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.744
30      images/sec: 3.2 +/- 0.0 (jitter = 0.0)  7.672
----------------------------------------------------------------
total images/sec: 3.20
----------------------------------------------------------------

Sadly slower than the CPU, with access to the same amount of memory so no difference in batch size there. Last but not least, the nvidia 2060-MaxQ:

#same exact command as above
Done warm up
Step    Img/sec total_loss
1       images/sec: 17.6 +/- 0.0 (jitter = 0.0) 7.780
10      images/sec: 17.6 +/- 0.0 (jitter = 0.1) 7.877
20      images/sec: 17.5 +/- 0.0 (jitter = 0.1) 7.744
30      images/sec: 17.5 +/- 0.0 (jitter = 0.1) 7.672
----------------------------------------------------------------
total images/sec: 17.51
----------------------------------------------------------------

I expect CUDA performance with WSL-2/Cuda to be a lot better.. but that's not the point here! All in all - as performance improves this will really change the game for ML beginners and pros alike!! Excited to be able to, AC-922 style, access the entire system memory for optimized GPU computations. Thanks again @Microsoft for this important work.

PatriceVignola commented 4 years ago

@metemadi Thank you for the additional data points - your setup looks awesome for ML!

We're currently focused on improving stability and coverage, but the next step is obviously to be way more competitive with CUDA. So far we've been focused on coverage from the ai-benchmark models, but the TF benchmarks repo is something we're starting to look into.

oscarbg commented 3 years ago

just pointing new 202104 release is much faster.. on Titan V nearly 2x faster than last year (june) release.. with that now CUDA is only 50% faster..

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50


Step    Img/sec total_loss
1       images/sec: 162.1 +/- 0.0 (jitter = 0.0)        8.169
10      images/sec: 177.8 +/- 23.1 (jitter = 2.0)       7.593
20      images/sec: 177.5 +/- 11.7 (jitter = 2.3)       7.696
30      images/sec: 177.2 +/- 11.3 (jitter = 1.7)       7.753
40      images/sec: 177.1 +/- 9.0 (jitter = 1.5)        8.007
50      images/sec: 176.7 +/- 7.2 (jitter = 1.4)        7.520
60      images/sec: 176.7 +/- 6.0 (jitter = 1.3)        7.988
70      images/sec: 176.3 +/- 5.2 (jitter = 1.6)        8.027
80      images/sec: 176.4 +/- 4.6 (jitter = 1.5)        7.931
90      images/sec: 176.3 +/- 4.1 (jitter = 1.6)        7.851
100     images/sec: 176.2 +/- 3.7 (jitter = 1.9)        7.794
----------------------------------------------------------------
total images/sec: 176.09
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 166.5 +/- 0.0 (jitter = 0.0)        7.993
10      images/sec: 161.9 +/- 1.3 (jitter = 3.3)        7.854
20      images/sec: 161.6 +/- 1.1 (jitter = 3.0)        7.726
30      images/sec: 161.7 +/- 6.9 (jitter = 2.1)        7.360
40      images/sec: 161.6 +/- 5.2 (jitter = 2.9)        7.527
50      images/sec: 161.5 +/- 4.2 (jitter = 2.8)        8.171
60      images/sec: 161.4 +/- 3.5 (jitter = 2.4)        7.999
70      images/sec: 161.4 +/- 3.0 (jitter = 2.5)        7.978
80      images/sec: 161.3 +/- 2.7 (jitter = 2.9)        7.883
90      images/sec: 161.3 +/- 3.2 (jitter = 2.9)        7.924
100     images/sec: 161.1 +/- 2.9 (jitter = 2.9)        7.847
----------------------------------------------------------------
total images/sec: 161.00
----------------------------------------------------------------

maybe this issue can be closed?

thanks..

oscarbg commented 3 years ago

confirm RX Vega also gets 2X speedup vs latest results:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 76.2 +/- 0.0 (jitter = 0.0) 8.169
10      images/sec: 76.0 +/- 0.2 (jitter = 0.4) 7.593
20      images/sec: 75.9 +/- 0.1 (jitter = 0.4) 7.696
30      images/sec: 75.8 +/- 0.1 (jitter = 0.5) 7.753
40      images/sec: 75.7 +/- 0.1 (jitter = 0.5) 8.007
50      images/sec: 75.6 +/- 0.1 (jitter = 0.6) 7.520
60      images/sec: 75.5 +/- 0.1 (jitter = 0.5) 7.988
70      images/sec: 75.5 +/- 0.1 (jitter = 0.5) 8.027
80      images/sec: 75.4 +/- 0.1 (jitter = 0.5) 7.932
90      images/sec: 75.3 +/- 0.1 (jitter = 0.7) 7.850
100     images/sec: 75.1 +/- 0.1 (jitter = 0.8) 7.797
----------------------------------------------------------------
total images/sec: 75.11
----------------------------------------------------------------

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50

Step    Img/sec total_loss
1       images/sec: 69.7 +/- 0.0 (jitter = 0.0) 7.993
10      images/sec: 70.4 +/- 0.2 (jitter = 0.3) 7.854
20      images/sec: 70.3 +/- 0.1 (jitter = 0.5) 7.726
30      images/sec: 68.1 +/- 2.2 (jitter = 0.5) 7.360
40      images/sec: 68.4 +/- 1.9 (jitter = 0.6) 7.527
50      images/sec: 68.7 +/- 1.5 (jitter = 0.6) 8.171
60      images/sec: 68.9 +/- 1.3 (jitter = 0.5) 7.999
70      images/sec: 69.0 +/- 1.1 (jitter = 0.5) 7.978
80      images/sec: 69.0 +/- 1.0 (jitter = 0.5) 7.884
90      images/sec: 69.1 +/- 0.9 (jitter = 0.5) 7.924
100     images/sec: 69.1 +/- 0.8 (jitter = 0.5) 7.848
----------------------------------------------------------------
total images/sec: 69.12
----------------------------------------------------------------