Unfortunately, the DEVICE_DEFAULT int32 registration for Fill in TensorFlow core is mistakenly trapped inside of a CUDA #ifdef, so we cannot leverage it. To work around this, we have to emulate it in our plugin like we did for Pack and StridedSlice.
By forcing this operator on DML but in host memory for int32, it forces some elementwise operators to be run on the GPU, which gives us a pretty significant performance improvement when running the model described here: https://github.com/microsoft/tensorflow-directml-plugin/discussions/315
Unfortunately, the
DEVICE_DEFAULT
int32 registration forFill
in TensorFlow core is mistakenly trapped inside of a CUDA#ifdef
, so we cannot leverage it. To work around this, we have to emulate it in our plugin like we did forPack
andStridedSlice
.By forcing this operator on DML but in host memory for int32, it forces some elementwise operators to be run on the GPU, which gives us a pretty significant performance improvement when running the model described here: https://github.com/microsoft/tensorflow-directml-plugin/discussions/315