Description

Our team is doing the optimization of Tensorflow on CPU, recently we found the operation Softmax in dot_product_attention layer would be very slow in some situation. dot_product_attention.

We may get an input of Softmax with shape [20, 8, 45, 45], it's more like NCHW format for Tensorflow. In current implementation Softmax would choose the last dimension [45] as the axis which means depth and it became very slow, but in general we prefer to chose the dimension 'channel' as the axis. Because the paper didn't tell how to choose the axis, does it make more sense to choose the 'head' dimension [8] as the axis of Softmax here?

Environment information

OS:
Linux version 3.10.0-862.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) )
Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz

$ pip freeze | grep tensor
mesh-tensorflow==0.0.4
tensor2tensor==1.11.0
tensorboard==1.12.0
tensorflow==1.12.0rc0
tensorflow-estimator==1.10.12
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0

$ python -V
Python 3.4.9

### For bugs: reproduction and error logs

Steps to reproduce:

Shell script

export KMP_BLOCKTIME=1
export OMP_NUM_THREADS=28
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0

python3 ./tensor2tensor/bin/t2t-datagen --problem=translate_ende_wmt32k  --data_dir=~/t2t_data --tmp_dir=~/t2t_data/tmp
rm out_dir/*

numactl --cpunodebind=0 --membind=0 python3 ./tensor2tensor/bin/t2t-trainer  --data_dir=~/t2t_data --problem=translate_ende_wmt32k --model=transformer --hparams_set=transformer_base_single_gpu --output_dir=./out_dir --hparams='batch_size=1024 ' --train_steps=300 --inter_op_parallelism_threads=1 --intra_op_parallelism_threads=28

Error logs:

Tensorflow profiling data, Softmax took too much time

Profile:
node name | requested bytes | total execution time | accelerator execution time | cpu execution time | op occurrence (run|defined)
SaveV2                                0B (0.00%, 0.00%),      7.83sec (100.00%, 83.77%),             0us (0.00%, 0.00%),      7.83sec (100.00%, 83.77%),        2|2
Softmax                        23.67MB (100.00%, 0.58%),       828.23ms (16.23%, 8.86%),             0us (0.00%, 0.00%),       828.23ms (16.23%, 8.86%),      18|18
MatMul                        848.05MB (99.42%, 20.93%),        195.63ms (7.36%, 2.09%),             0us (0.00%, 0.00%),        195.63ms (7.36%, 2.09%),    291|291
RandomUniform                  343.65MB (78.49%, 8.48%),         74.13ms (5.27%, 0.79%),             0us (0.00%, 0.00%),         74.13ms (5.27%, 0.79%),    159|159
Mul                           753.12MB (70.01%, 18.59%),         50.08ms (4.47%, 0.54%),             0us (0.00%, 0.00%),         50.08ms (4.47%, 0.54%),    652|812
ResourceApplyAdam               10.78MB (51.42%, 0.27%),         37.84ms (3.94%, 0.41%),             0us (0.00%, 0.00%),         37.84ms (3.94%, 0.41%),    201|201

tensorflow / tensor2tensor

Which dim should be the correct axis of Softmax in dot_product_attention layer? #1251

Description

Environment information

Steps to reproduce:

Error logs: