[Question] Low TPU Utilization for tranformer_big_enfr_tpu

Description

I tried training a transformer model for translate_enfr_wmt32k_packed on a single TPU-v2 (8 cores) using the default transformer-transformer_big_enfr_tpu hparams and noticed the TPU is heavily underutilized. The cloud_tpu_profiler tool reported only 18% TPU utilization because 69% of the time was spent on OutfeedEnqueueTuple operations. Looking through the trace_viewer, it seems that OutfeedEnqueueTuple is some sort of transition from loss to gradient computation because it is always preceded by a cross entropy op and followed by a softmax gradient op.

How can I reduce the time spent in OutfeedEnqueueTuple and what is it actually doing that takes so long? This is the only documentation I was able to find on it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/tpu/ops/outfeed_ops.cc#L67.

Environment information

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.5 LTS
Release:    16.04
Codename:   xenial
Base Image: tensorflow/tensorflow:1.11.0-py3
GKE Version:    1.11.2-gke.9
Machine Type:   n1-standard-8
TPU Version:    v2

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.11.0
tensorflow==1.11.0

$ python -V
Python 2.7.12

$ python3 -V
Python 3.5.2

Steps to reproduce:

t2t-trainer \
  --model=transformer \
  --problem=translate_enfr_wmt32k_packed \
  --hparams_set=transformer_big_enfr_tpu \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --use_tpu=True \
  --cloud_tpu_name=$KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS

capture_tpu_profile \
  --service_addr $KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS \
  --logdir $OUT_DIR \
  --duration_ms 60000

Logs:

Input pipeline analysis:

TPU Utilization:

Full trace:

Op before OutfeedEnqueue:

Op after OutfeedEnqueue:

tensorflow / tensor2tensor