Our team is doing the optimization of Tensorflow on CPU, recently we found the operation Softmax in dot_product_attention layer would be very slow in some situation.
dot_product_attention.
We may get an input of Softmax with shape [20, 8, 45, 45], it's more like NCHW format for Tensorflow. In current implementation Softmax would choose the last dimension [45] as the axis which means depth and it became very slow, but in general we prefer to chose the dimension 'channel' as the axis. Because the paper didn't tell how to choose the axis, does it make more sense to choose the 'head' dimension [8] as the axis of Softmax here?
Environment information
OS:
Linux version 3.10.0-862.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) )
Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
$ pip freeze | grep tensor
mesh-tensorflow==0.0.4
tensor2tensor==1.11.0
tensorboard==1.12.0
tensorflow==1.12.0rc0
tensorflow-estimator==1.10.12
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0
$ python -V
Python 3.4.9
### For bugs: reproduction and error logs
Description
Our team is doing the optimization of Tensorflow on CPU, recently we found the operation Softmax in dot_product_attention layer would be very slow in some situation. dot_product_attention.
We may get an input of Softmax with shape [20, 8, 45, 45], it's more like NCHW format for Tensorflow. In current implementation Softmax would choose the last dimension [45] as the axis which means depth and it became very slow, but in general we prefer to chose the dimension 'channel' as the axis. Because the paper didn't tell how to choose the axis, does it make more sense to choose the 'head' dimension [8] as the axis of Softmax here?
Environment information
Steps to reproduce:
Shell script
Error logs:
Tensorflow profiling data, Softmax took too much time