microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.12k stars 284 forks source link

Failed fuse leaky relu with convolution on RTX 3090 #138

Open jb2020-super opened 3 years ago

jb2020-super commented 3 years ago

This is my code https://github.com/jb2020-super/test-DirectML.git

According to the PIX analysis result, the convolution with FusedActivation set to DML_OPERATOR_ACTIVATION_LEAKY_RELU is splitted into two convolution ops. But when replaced with DML_OPERATOR_ACTIVATION_RELU, fusion succeed. How to solve this?

pix
adtsai commented 3 years ago

Hi,

DirectML fuses operators opportunistically - that is, when it is both possible to fuse and there is a performance benefit to doing so. Unfortunately in this case it appears it wasn't possible to fuse the LEAKY_RELU with the metacommand (as the level of metacommand support can vary by hardware and driver version). You might be able to achieve the fusion by using the DISABLE_METACOMMANDS flag, but that's likely to result in worse performance. Let us know if you have an end-to-end scenario that's impacted by this - if there's data that shows a substantial performance difference, this is something we can raise with hardware vendors as a potential optimization in future.

jb2020-super commented 3 years ago

Hi @adtsai , DISABLE_METACOMMANDS will result into bad performance. I replaced the model in the DirectMLSuperResolution sample with a seven-layer CNN and tested it.The results are as follows.

Environment

Test Result

Model AMD RX 5700 XT (frame time) NVIDIA RTX 3090(frame time)
Demo 38.41 ms 10.975 ms
7-layer CNN 41.10 ms 33.254 ms
7-layer CNN(disable metacommand) 133.69 ms 115.50 ms

Summary

PIX Analysis

Demo model on 5700XT

5700xt_demo

7-layer CNN on 5700XT

5700xt_upconv7

Demo model on 3090

demo_model

7-layer CNN on 3090

leaky_relu

7-layer CNN on 3090 disable metacommand

disable_metacommand