Open MatPoliquin opened 4 years ago
GTX1080Ti got total 32.48 images/sec.
tf.contrib.resampler
should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
Done warm up
Step Img/sec total_loss
1 images/sec: 35.7 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 32.2 +/- 1.1 (jitter = 4.0) 7.593
20 images/sec: 32.7 +/- 0.7 (jitter = 3.3) 7.696
30 images/sec: 32.3 +/- 0.7 (jitter = 3.7) 7.753
40 images/sec: 32.3 +/- 0.6 (jitter = 4.5) 8.007
50 images/sec: 32.7 +/- 0.5 (jitter = 4.2) 7.520
60 images/sec: 32.8 +/- 0.5 (jitter = 3.9) 7.988
70 images/sec: 32.5 +/- 0.5 (jitter = 3.9) 8.028
80 images/sec: 32.6 +/- 0.4 (jitter = 3.7) 7.932
90 images/sec: 32.4 +/- 0.4 (jitter = 4.0) 7.850
100 images/sec: 32.5 +/- 0.4 (jitter = 3.9) 7.795Tensorflow-GPU 1.15.3 (official) with cuda got 174.89 images/sec.
The system is an AMD R7 1700X with 64GB RAM. Windows 10 20H1. Also, half of the 1st screen flashes a little bit for a few times throughout the benchmark.
Thank you for reporting your benchmark results! This is a preview and we only have a limited set of operators implemented at the moment, so results like this are not totally unexpected. As operator support gets closer to what CUDA/ROCm supports, we expect performance to get better and we'll be able to focus on it a lot more. We'll definitely look into this benchmark though and see where the bottlenecks are.
First, absolutely thank you! Being able to do this from any OS that supports DirectX12, that is amazing. Second, if I can help, let me know.
Summary:
Hardware: Stock laptop Acer Predator Helios 500 PH517-61-R0GX Gaming Laptop, AMD Ryzen 7 2700 Desktop Processor, AMD Radeon RX Vega 56
DirectML Results (python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50)
Step Img/sec total_loss
1 images/sec: 20.3 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 20.6 +/- 0.1 (jitter = 0.3) 7.854
20 images/sec: 20.6 +/- 0.1 (jitter = 0.2) 7.726
30 images/sec: 20.5 +/- 0.1 (jitter = 0.2) 7.360
40 images/sec: 20.6 +/- 0.0 (jitter = 0.3) 7.526
50 images/sec: 20.6 +/- 0.0 (jitter = 0.2) 8.171
60 images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.999
70 images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.978
80 images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.884
90 images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.924
100 images/sec: 20.6 +/- 0.0 (jitter = 0.2) 7.848
----------------------------------------------------------------
total images/sec: 20.65
----------------------------------------------------------------
Results with --enable_optimizations=0 (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0):
Step Img/sec total_loss
1 images/sec: 30.1 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 30.1 +/- 0.1 (jitter = 0.3) 7.854
20 images/sec: 30.1 +/- 0.1 (jitter = 0.1) 7.726
30 images/sec: 30.2 +/- 0.1 (jitter = 0.2) 7.360
40 images/sec: 30.1 +/- 0.0 (jitter = 0.2) 7.527
50 images/sec: 30.1 +/- 0.0 (jitter = 0.2) 8.171
60 images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.999
70 images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.978
80 images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.884
90 images/sec: 30.0 +/- 0.0 (jitter = 0.2) 7.925
100 images/sec: 30.1 +/- 0.0 (jitter = 0.2) 7.848
----------------------------------------------------------------
total images/sec: 27.27
----------------------------------------------------------------
ROCm Results (python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50):
Step Img/sec total_loss
1 images/sec: 131.4 +/- 0.0 (jitter = 0.0) 8.458
10 images/sec: 130.0 +/- 0.9 (jitter = 2.9) 7.997
20 images/sec: 129.1 +/- 0.6 (jitter = 2.2) 8.260
30 images/sec: 128.6 +/- 0.5 (jitter = 2.0) 8.338
40 images/sec: 128.4 +/- 0.4 (jitter = 2.3) 8.190
50 images/sec: 128.0 +/- 0.4 (jitter = 2.7) 7.742
60 images/sec: 128.2 +/- 0.4 (jitter = 2.4) 8.061
70 images/sec: 128.3 +/- 0.3 (jitter = 2.4) inf
80 images/sec: 128.3 +/- 0.3 (jitter = 2.5) inf
90 images/sec: 128.2 +/- 0.3 (jitter = 2.5) inf
100 images/sec: 128.2 +/- 0.3 (jitter = 2.5) inf
----------------------------------------------------------------
total images/sec: 128.13
----------------------------------------------------------------
It's great to hear that the DirectML stack is working well for you! These results are interesting, and it's good to hear that it's behaving in a stable manner because stability and correctness is something we invest a lot of time on.
As @PatriceVignola mentioned this is a super early preview and we're still working hard on it, so you can definitely expect the performance to improve as time goes on. For example I suspect one of the reasons why --batch_size 32
is so much slower on DML is because we haven't optimized our memory allocator yet, which means that at high batch sizes we end up using more VRAM than necessary in some circumstances, which leads to a performance cliff. But rest assured we're working on it. :)
@MatPoliquin , @sunshinejnjn , @ashaver , we just uploaded a new package that improves the performance of TensorFlow DirectML devices across the board. The package (1.15.3.dev200626) is now on pypi and can be installed it with
pip install tensorflow-directml
if it's your first time installing it or
pip install tensorflow-directml --upgrade
if you installed the previous 1.15.3.dev200619 release.
On a Radeon RX Vega, we see a ~63% performance increase for batch_size=16 and a ~47% performance increase for batch_size=32. These improvements are not limited to AMD cards though, so we are expecting similar improvements for Nvidia and Intel graphics.
We realize that there is still a lot of room for improvement to catch up with ROCm and CUDA, but we aim to release packages regularly and keep the community updated on our progress. All feedback and data that we receive is very helpful as we work on closing the performance and functionality gap.
Here are the full results for a Radeon RX Vega with a batch size of 16:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 36.6 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 35.7 +/- 0.2 (jitter = 0.0) 7.854
20 images/sec: 35.6 +/- 0.1 (jitter = 0.0) 7.726
30 images/sec: 35.6 +/- 0.1 (jitter = 0.0) 7.360
40 images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.526
50 images/sec: 35.6 +/- 0.1 (jitter = 0.0) 8.171
60 images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.999
70 images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.978
80 images/sec: 35.5 +/- 0.1 (jitter = 0.0) 7.884
90 images/sec: 35.5 +/- 0.1 (jitter = 1.7) 7.924
100 images/sec: 35.5 +/- 0.1 (jitter = 1.7) 7.848
----------------------------------------------------------------
total images/sec: 35.48
----------------------------------------------------------------
And here are the full results for a Radeon RX Vega with a batch size of 32:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step Img/sec total_loss
1 images/sec: 8.8 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 9.3 +/- 0.1 (jitter = 0.7) 7.593
20 images/sec: 9.3 +/- 0.1 (jitter = 0.4) 7.696
30 images/sec: 9.3 +/- 0.1 (jitter = 0.5) 7.753
40 images/sec: 9.3 +/- 0.1 (jitter = 0.4) 8.007
50 images/sec: 9.3 +/- 0.1 (jitter = 0.4) 7.520
60 images/sec: 9.3 +/- 0.0 (jitter = 0.4) 7.990
70 images/sec: 9.3 +/- 0.0 (jitter = 0.4) 8.028
80 images/sec: 9.3 +/- 0.0 (jitter = 0.4) 7.931
90 images/sec: 9.3 +/- 0.0 (jitter = 0.4) 7.851
100 images/sec: 9.3 +/- 0.0 (jitter = 0.4) 7.797
----------------------------------------------------------------
total images/sec: 9.26
----------------------------------------------------------------
Edit: Clarify package release timelines.
Just tried the new 1.15.3.dev200626 version, I actually get worst performance on RX580
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step Img/sec total_loss
1 images/sec: 3.9 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.593
20 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.696
30 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.753
40 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 8.007
50 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.520
60 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.988
70 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 8.029
80 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.932
90 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.850
100 images/sec: 4.0 +/- 0.0 (jitter = 0.1) 7.799
----------------------------------------------------------------
total images/sec: 4.04
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 10.3 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 10.5 +/- 0.1 (jitter = 0.3) 7.854
20 images/sec: 10.5 +/- 0.1 (jitter = 0.3) 7.726
30 images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.360
40 images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.527
50 images/sec: 10.7 +/- 0.1 (jitter = 0.3) 8.171
60 images/sec: 10.7 +/- 0.1 (jitter = 0.4) 7.999
70 images/sec: 10.7 +/- 0.0 (jitter = 0.4) 7.978
80 images/sec: 10.8 +/- 0.0 (jitter = 0.4) 7.884
90 images/sec: 10.8 +/- 0.0 (jitter = 0.5) 7.924
100 images/sec: 10.9 +/- 0.0 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 10.88
----------------------------------------------------------------
This is interesting. I don't have access to an RX 580 at the moment, but we tried with 3 different AMD cards (Radeon VII, Radeon RX Vega and Radeon RX 5700 XT) and saw a 50% performance increase on average. I have a few questions to help me understand the issue:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0
), does it get better or worse?Also, if you don't mind, could you take a trace, upload it somewhere and send us the link?
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --trace_file=trace.json
EDIT: for some reason I did not install the 1.15.3.dev200626 version properly, I reinstalled it and now I get better performance
Windows
only one GPU
here is the result:
Step Img/sec total_loss
1 images/sec: 4.9 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 7.593
20 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.696
30 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.753
40 images/sec: 5.0 +/- 0.0 (jitter = 0.1) 8.007
50 images/sec: 5.0 +/- 0.0 (jitter = 0.2) 7.520
60 images/sec: 5.0 +/- 0.0 (jitter = 0.2) 7.988
70 images/sec: 5.0 +/- 0.0 (jitter = 0.2) 8.029
80 images/sec: 5.0 +/- 0.0 (jitter = 0.2) 7.932
90 images/sec: 5.0 +/- 0.0 (jitter = 0.2) 7.850
100 images/sec: 5.0 +/- 0.0 (jitter = 0.2) 7.799
----------------------------------------------------------------
total images/sec: 4.98
----------------------------------------------------------------
Here is the zipped trace.json file trace.zip
Note: The performance increase is more noticeable with --batch=16:
Step Img/sec total_loss
1 images/sec: 20.2 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.854
20 images/sec: 20.1 +/- 0.1 (jitter = 0.2) 7.726
30 images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.360
40 images/sec: 20.1 +/- 0.0 (jitter = 0.2) 7.527
50 images/sec: 20.2 +/- 0.0 (jitter = 0.2) 8.171
60 images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.999
70 images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.978
80 images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.884
90 images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.924
100 images/sec: 20.2 +/- 0.0 (jitter = 0.2) 7.847
----------------------------------------------------------------
total images/sec: 20.18
----------------------------------------------------------------
I'm testing this thing on various devices including intel iGPU, amd iGPU (amd dGPU on native linux, not tested for now), nvidia 9xx 10xx 20xx systems. Especially, iGPUs are the most interesting part. AMD ryzen 4500U with vega 6 failed to run the benchmark with build 200615. The system froze when running it. And it seemed to be a gpu reset with a beep after 1 minute or so, then reported some error as output. with build 200626, the situation is similar, no beep reset but still unable to run. System version: Windows 10 2020H1 Driver version: AMD 27.20.1017.1011 (dated 20200525, newest amd gpu driver at present, Adrenalin 2020 Edition 20.5.1)
Another device which is a dell xps 15 9550, i7-6700HQ with intel HD 530 running latest Intel beta driver. It ran this at 1.8 images/sec (on intel iGPU). Windows 10 2020H1, TF-DML build 200626.
Later, I'm gonna test this on a intel i5-4000 with iGPU to see if it can run.
@MatPoliquin Ah, these numbers make more sense. Thank you for double checking! Like @adtsai said, we didn't optimize our memory allocator yet so the performance increase for larger batch sizes is less noticeable and we end up utilizing more memory than necessary, but we're working on improving it.
@sunshinejnjn What are the models of the iGPUs/dGPUs that crashed or froze while running the benchmark?
with 26-6-2020 package (tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64)
my results on Titan V (451.58):
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step Img/sec total_loss
1 images/sec: 95.8 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 94.8 +/- 0.4 (jitter = 1.1) 7.593
20 images/sec: 95.1 +/- 0.2 (jitter = 0.9) 7.696
30 images/sec: 94.8 +/- 0.2 (jitter = 1.1) 7.753
40 images/sec: 94.9 +/- 0.2 (jitter = 0.7) 8.006
50 images/sec: 94.7 +/- 0.1 (jitter = 1.0) 7.520
60 images/sec: 94.6 +/- 0.1 (jitter = 0.9) 7.989
70 images/sec: 94.5 +/- 0.1 (jitter = 0.8) 8.028
80 images/sec: 94.5 +/- 0.1 (jitter = 0.8) 7.930
90 images/sec: 94.4 +/- 0.1 (jitter = 0.8) 7.849
100 images/sec: 94.3 +/- 0.1 (jitter = 0.9) 7.795
----------------------------------------------------------------
total images/sec: 94.29
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 80.1 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 80.0 +/- 0.3 (jitter = 0.6) 7.854
20 images/sec: 80.1 +/- 0.1 (jitter = 0.3) 7.726
30 images/sec: 80.0 +/- 0.1 (jitter = 0.3) 7.360
40 images/sec: 80.0 +/- 0.1 (jitter = 0.4) 7.527
50 images/sec: 79.9 +/- 0.1 (jitter = 0.4) 8.171
60 images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.999
70 images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.978
80 images/sec: 79.9 +/- 0.1 (jitter = 0.5) 7.884
90 images/sec: 79.8 +/- 0.1 (jitter = 0.5) 7.924
100 images/sec: 79.7 +/- 0.1 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 79.72
----------------------------------------------------------------
on vega56 (20.20 branch 27.20.2001.5002) similar to others
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 36.9 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 37.0 +/- 0.0 (jitter = 0.1) 7.854
20 images/sec: 36.9 +/- 0.0 (jitter = 0.1) 7.726
30 images/sec: 36.9 +/- 0.0 (jitter = 0.1) 7.360
40 images/sec: 36.8 +/- 0.0 (jitter = 0.1) 7.526
50 images/sec: 36.8 +/- 0.0 (jitter = 0.2) 8.171
60 images/sec: 36.7 +/- 0.1 (jitter = 0.2) 7.999
70 images/sec: 36.4 +/- 0.1 (jitter = 0.3) 7.978
80 images/sec: 36.2 +/- 0.1 (jitter = 0.3) 7.884
90 images/sec: 36.0 +/- 0.1 (jitter = 0.4) 7.924
100 images/sec: 35.9 +/- 0.1 (jitter = 0.6) 7.848
----------------------------------------------------------------
total images/sec: 35.91
----------------------------------------------------------------
EDIT: adding CUDA Titan V scores: so seems 3X improvement vs current DirectML..
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step Img/sec total_loss
1 images/sec: 285.6 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 289.8 +/- 1.4 (jitter = 3.9) 8.049
20 images/sec: 289.9 +/- 0.8 (jitter = 1.9) 7.808
30 images/sec: 289.3 +/- 0.7 (jitter = 3.7) 7.976
40 images/sec: 289.8 +/- 0.6 (jitter = 3.8) 7.591
50 images/sec: 289.8 +/- 0.5 (jitter = 3.7) 7.549
60 images/sec: 289.5 +/- 0.4 (jitter = 3.7) 7.819
70 images/sec: 289.4 +/- 0.4 (jitter = 3.8) 7.821
80 images/sec: 289.5 +/- 0.4 (jitter = 3.8) 7.849
90 images/sec: 289.3 +/- 0.4 (jitter = 3.8) 8.027
100 images/sec: 289.4 +/- 0.4 (jitter = 3.8) 8.030
----------------------------------------------------------------
total images/sec: 289.27
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 235.5 +/- 0.0 (jitter = 0.0) 8.034
10 images/sec: 237.0 +/- 1.2 (jitter = 5.0) 7.686
20 images/sec: 236.5 +/- 0.9 (jitter = 5.0) 7.657
30 images/sec: 236.7 +/- 0.7 (jitter = 5.0) 8.194
40 images/sec: 237.0 +/- 0.6 (jitter = 5.1) 7.897
50 images/sec: 236.9 +/- 0.5 (jitter = 5.0) 7.999
60 images/sec: 236.9 +/- 0.5 (jitter = 4.9) 7.912
70 images/sec: 236.9 +/- 0.4 (jitter = 4.9) 8.180
80 images/sec: 236.9 +/- 0.4 (jitter = 4.9) 8.351
90 images/sec: 236.8 +/- 0.4 (jitter = 4.9) 8.115
100 images/sec: 237.1 +/- 0.4 (jitter = 5.0) 7.822
----------------------------------------------------------------
total images/sec: 237.04
----------------------------------------------------------------
Adding my AMD 5700XT results (no overclock) in case it helps verify the expected outcome
Windows 10 2004 - AMD Drivers 20.5.1-ghs-beta
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 36.3 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 37.6 +/- 0.4 (jitter = 1.4) 7.854
20 images/sec: 38.2 +/- 0.3 (jitter = 2.0) 7.726
30 images/sec: 38.5 +/- 0.2 (jitter = 2.0) 7.360
40 images/sec: 38.4 +/- 0.2 (jitter = 2.0) 7.526
50 images/sec: 38.3 +/- 0.2 (jitter = 2.0) 8.171
60 images/sec: 38.1 +/- 0.2 (jitter = 2.0) 7.999
70 images/sec: 38.0 +/- 0.2 (jitter = 2.0) 7.978
80 images/sec: 37.9 +/- 0.1 (jitter = 2.0) 7.884
90 images/sec: 38.0 +/- 0.1 (jitter = 2.0) 7.924
100 images/sec: 37.9 +/- 0.1 (jitter = 2.0) 7.848
----------------------------------------------------------------
total images/sec: 37.94
----------------------------------------------------------------
and updated drivers 20.7.1
Step Img/sec total_loss
1 images/sec: 39.4 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 39.3 +/- 0.1 (jitter = 0.1) 7.854
20 images/sec: 38.9 +/- 0.1 (jitter = 0.6) 7.726
30 images/sec: 38.7 +/- 0.1 (jitter = 0.6) 7.360
40 images/sec: 38.6 +/- 0.1 (jitter = 0.6) 7.526
50 images/sec: 38.7 +/- 0.1 (jitter = 0.6) 8.171
60 images/sec: 38.8 +/- 0.1 (jitter = 0.6) 7.999
70 images/sec: 38.8 +/- 0.1 (jitter = 0.5) 7.978
80 images/sec: 38.8 +/- 0.1 (jitter = 0.4) 7.884
90 images/sec: 38.8 +/- 0.0 (jitter = 0.4) 7.924
100 images/sec: 38.8 +/- 0.0 (jitter = 0.4) 7.848
----------------------------------------------------------------
total images/sec: 38.80
----------------------------------------------------------------
Windows 10 2004/AMD Driver v27.20.1017.1011/DirectML
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --enable_optimizations=0 --trace_file=trace.json
[trace_r290.zip](https://github.com/microsoft/DirectML/files/4881411/trace_r290.zip)
* AMD Ryzen 4700U on Vega 7 Graphics
python .\tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50 --variable_update=parameter_server
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50 --enable_optimizations=0 --trace_file=trace.json
[trace_vega7.zip](https://github.com/microsoft/DirectML/files/4881375/trace_vega7.zip)
Hi, performance is worse with new update (tensorflow-directml 1.15.3.dev200911) for example using on Titan V (460.15): python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
now I get:
Step Img/sec total_loss
1 images/sec: 55.7 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 56.2 +/- 0.1 (jitter = 0.3) 7.854
20 images/sec: 56.3 +/- 0.2 (jitter = 0.4) 7.726
30 images/sec: 56.2 +/- 0.1 (jitter = 0.4) 7.360
40 images/sec: 56.1 +/- 0.1 (jitter = 0.5) 7.527
50 images/sec: 56.1 +/- 0.1 (jitter = 0.5) 8.171
60 images/sec: 55.5 +/- 0.3 (jitter = 0.5) 7.999
70 images/sec: 55.5 +/- 0.2 (jitter = 0.7) 7.978
80 images/sec: 55.5 +/- 0.2 (jitter = 0.7) 7.884
90 images/sec: 55.4 +/- 0.2 (jitter = 0.8) 7.924
100 images/sec: 55.4 +/- 0.2 (jitter = 0.8) 7.847
----------------------------------------------------------------
total images/sec: 55.39
----------------------------------------------------------------
on June was getting:
Step Img/sec total_loss
1 images/sec: 80.1 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 80.0 +/- 0.3 (jitter = 0.6) 7.854
20 images/sec: 80.1 +/- 0.1 (jitter = 0.3) 7.726
30 images/sec: 80.0 +/- 0.1 (jitter = 0.3) 7.360
40 images/sec: 80.0 +/- 0.1 (jitter = 0.4) 7.527
50 images/sec: 79.9 +/- 0.1 (jitter = 0.4) 8.171
60 images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.999
70 images/sec: 79.9 +/- 0.1 (jitter = 0.4) 7.978
80 images/sec: 79.9 +/- 0.1 (jitter = 0.5) 7.884
90 images/sec: 79.8 +/- 0.1 (jitter = 0.5) 7.924
100 images/sec: 79.7 +/- 0.1 (jitter = 0.5) 7.847
----------------------------------------------------------------
total images/sec: 79.72
----------------------------------------------------------------
I get on console output lots of messages like posted below.. this seems to point to the performance issue as seems to do now some work on DirectML CPU backend (DML CPU):
2020-09-20 21:29:44.368137: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
/job:localhost/replica:0/task:0/device:DML:0
/job:localhost/replica:0/task:0/device:DML:1
/job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[DML, CPU] possible_devices_=[]
Assign: DML CPU
Const: DML CPU
VariableV2: DML CPU
Identity: DML CPU
ApplyGradientDescent: DML CPU
IsVariableInitialized: DML CPU
seems wasn't getting this on June testing.. full log of the run: fullogdmlcpu.txt
@oscarbg Were you not getting these logs with the previous package? I think these logs are expected when running tf_cnn_benchmarks since it doesn't know anything about DML, but it should still try to fallback to DML instead of the CPU when possible. We'll investigate the performance regression.
Wanted to add my results! First off this is some amazing work - really helps increase the accessibility of ML tools!!! I am running a Zephyrus G14 which has two GPUs, the integrated Radeon and a 2060-MaxQ. Running in windows for now (will try WSL-2 soon as well). Can confirm that I can get bigger batch sizes using the Radeon (which has access to 40GB of installed system ram - yes.. 8gb soldered with a 32gb stick.. yay!) than I can with the Max-Q (6gb vram only). This really opens up a lot of possibilities.. but.. its all just pretty slow compared to the CPU (the 4900HS).
First.. the CPU:
python tf_cnn_benchmarks.py --batch_size=16 --model=resnet50 --enable_optimizations=0 --device='cpu' --data_format='NHWC' --num_batches=30
Step Img/sec total_loss
1 images/sec: 3.8 +/- 0.0 (jitter = 0.0) 7.780
10 images/sec: 3.7 +/- 0.0 (jitter = 0.0) 7.877
20 images/sec: 3.7 +/- 0.0 (jitter = 0.1) 7.744
30 images/sec: 3.7 +/- 0.0 (jitter = 0.1) 7.672
----------------------------------------------------------------
total images/sec: 3.69
----------------------------------------------------------------
Next - the Radeon (I had to disable the Max-Q in device manager.. couldn't find an easy way to make the gigantic Tensorflow benchmark pick /dml:1 which is the Radeon, /dml:0 being the Max-Q):
python tf_cnn_benchmarks.py --batch_size=16 --model=resnet50 --enable_optimizations=0 --data_format='NHWC' --num_batches=30
Step Img/sec total_loss
1 images/sec: 3.2 +/- 0.0 (jitter = 0.0) 7.780
10 images/sec: 3.2 +/- 0.0 (jitter = 0.0) 7.877
20 images/sec: 3.2 +/- 0.0 (jitter = 0.0) 7.744
30 images/sec: 3.2 +/- 0.0 (jitter = 0.0) 7.672
----------------------------------------------------------------
total images/sec: 3.20
----------------------------------------------------------------
Sadly slower than the CPU, with access to the same amount of memory so no difference in batch size there. Last but not least, the nvidia 2060-MaxQ:
#same exact command as above
Done warm up
Step Img/sec total_loss
1 images/sec: 17.6 +/- 0.0 (jitter = 0.0) 7.780
10 images/sec: 17.6 +/- 0.0 (jitter = 0.1) 7.877
20 images/sec: 17.5 +/- 0.0 (jitter = 0.1) 7.744
30 images/sec: 17.5 +/- 0.0 (jitter = 0.1) 7.672
----------------------------------------------------------------
total images/sec: 17.51
----------------------------------------------------------------
I expect CUDA performance with WSL-2/Cuda to be a lot better.. but that's not the point here! All in all - as performance improves this will really change the game for ML beginners and pros alike!! Excited to be able to, AC-922 style, access the entire system memory for optimized GPU computations. Thanks again @Microsoft for this important work.
@metemadi Thank you for the additional data points - your setup looks awesome for ML!
We're currently focused on improving stability and coverage, but the next step is obviously to be way more competitive with CUDA. So far we've been focused on coverage from the ai-benchmark models, but the TF benchmarks repo is something we're starting to look into.
just pointing new 202104 release is much faster.. on Titan V nearly 2x faster than last year (june) release.. with that now CUDA is only 50% faster..
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step Img/sec total_loss
1 images/sec: 162.1 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 177.8 +/- 23.1 (jitter = 2.0) 7.593
20 images/sec: 177.5 +/- 11.7 (jitter = 2.3) 7.696
30 images/sec: 177.2 +/- 11.3 (jitter = 1.7) 7.753
40 images/sec: 177.1 +/- 9.0 (jitter = 1.5) 8.007
50 images/sec: 176.7 +/- 7.2 (jitter = 1.4) 7.520
60 images/sec: 176.7 +/- 6.0 (jitter = 1.3) 7.988
70 images/sec: 176.3 +/- 5.2 (jitter = 1.6) 8.027
80 images/sec: 176.4 +/- 4.6 (jitter = 1.5) 7.931
90 images/sec: 176.3 +/- 4.1 (jitter = 1.6) 7.851
100 images/sec: 176.2 +/- 3.7 (jitter = 1.9) 7.794
----------------------------------------------------------------
total images/sec: 176.09
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 166.5 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 161.9 +/- 1.3 (jitter = 3.3) 7.854
20 images/sec: 161.6 +/- 1.1 (jitter = 3.0) 7.726
30 images/sec: 161.7 +/- 6.9 (jitter = 2.1) 7.360
40 images/sec: 161.6 +/- 5.2 (jitter = 2.9) 7.527
50 images/sec: 161.5 +/- 4.2 (jitter = 2.8) 8.171
60 images/sec: 161.4 +/- 3.5 (jitter = 2.4) 7.999
70 images/sec: 161.4 +/- 3.0 (jitter = 2.5) 7.978
80 images/sec: 161.3 +/- 2.7 (jitter = 2.9) 7.883
90 images/sec: 161.3 +/- 3.2 (jitter = 2.9) 7.924
100 images/sec: 161.1 +/- 2.9 (jitter = 2.9) 7.847
----------------------------------------------------------------
total images/sec: 161.00
----------------------------------------------------------------
maybe this issue can be closed?
thanks..
confirm RX Vega also gets 2X speedup vs latest results:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step Img/sec total_loss
1 images/sec: 76.2 +/- 0.0 (jitter = 0.0) 8.169
10 images/sec: 76.0 +/- 0.2 (jitter = 0.4) 7.593
20 images/sec: 75.9 +/- 0.1 (jitter = 0.4) 7.696
30 images/sec: 75.8 +/- 0.1 (jitter = 0.5) 7.753
40 images/sec: 75.7 +/- 0.1 (jitter = 0.5) 8.007
50 images/sec: 75.6 +/- 0.1 (jitter = 0.6) 7.520
60 images/sec: 75.5 +/- 0.1 (jitter = 0.5) 7.988
70 images/sec: 75.5 +/- 0.1 (jitter = 0.5) 8.027
80 images/sec: 75.4 +/- 0.1 (jitter = 0.5) 7.932
90 images/sec: 75.3 +/- 0.1 (jitter = 0.7) 7.850
100 images/sec: 75.1 +/- 0.1 (jitter = 0.8) 7.797
----------------------------------------------------------------
total images/sec: 75.11
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --model=resnet50
Step Img/sec total_loss
1 images/sec: 69.7 +/- 0.0 (jitter = 0.0) 7.993
10 images/sec: 70.4 +/- 0.2 (jitter = 0.3) 7.854
20 images/sec: 70.3 +/- 0.1 (jitter = 0.5) 7.726
30 images/sec: 68.1 +/- 2.2 (jitter = 0.5) 7.360
40 images/sec: 68.4 +/- 1.9 (jitter = 0.6) 7.527
50 images/sec: 68.7 +/- 1.5 (jitter = 0.6) 8.171
60 images/sec: 68.9 +/- 1.3 (jitter = 0.5) 7.999
70 images/sec: 69.0 +/- 1.1 (jitter = 0.5) 7.978
80 images/sec: 69.0 +/- 1.0 (jitter = 0.5) 7.884
90 images/sec: 69.1 +/- 0.9 (jitter = 0.5) 7.924
100 images/sec: 69.1 +/- 0.8 (jitter = 0.5) 7.848
----------------------------------------------------------------
total images/sec: 69.12
----------------------------------------------------------------
I get low performance on TF benchmarks with my RX 580: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
using their example command:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server
I get this error and performance result: 2020-06-19 16:01:17.369204: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:533] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost'
Step Img/sec total_loss 1 images/sec: 4.8 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 4.7 +/- 0.0 (jitter = 0.1) 7.593 20 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.696 30 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.753 40 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 8.007 50 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 7.520 60 images/sec: 4.8 +/- 0.0 (jitter = 0.2) 7.989 70 images/sec: 4.8 +/- 0.0 (jitter = 0.1) 8.028 80 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.932 90 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.850 100 images/sec: 4.9 +/- 0.0 (jitter = 0.1) 7.798
total images/sec: 4.90
Note:
Info: