Closed Boffee closed 5 years ago
Looks like this was caused by the host_call
function in the TPUEstimator
, which was always enabled in tensor2tensor version 1.9.0 and now defaults to disabled in version 1.10.0.
From my basic understanding, the outfeed queue is used to store the output of XLA compiled graphs that they can be access by other ops and XLA graphs. The host_call
function is slow because it copies the data from the TPU to the host machine (CPU) on very iteration, which is very expensive.
Description
I tried training a transformer model for
translate_enfr_wmt32k_packed
on a single TPU-v2 (8 cores) using the defaulttransformer-transformer_big_enfr_tpu
hparams and noticed the TPU is heavily underutilized. Thecloud_tpu_profiler
tool reported only 18% TPU utilization because 69% of the time was spent onOutfeedEnqueueTuple
operations. Looking through the trace_viewer, it seems thatOutfeedEnqueueTuple
is some sort of transition from loss to gradient computation because it is always preceded by a cross entropy op and followed by a softmax gradient op.How can I reduce the time spent in
OutfeedEnqueueTuple
and what is it actually doing that takes so long? This is the only documentation I was able to find on it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/tpu/ops/outfeed_ops.cc#L67.Environment information
Steps to reproduce:
Logs:
Input pipeline analysis:
TPU Utilization:
Full trace:
Op before OutfeedEnqueue:
Op after OutfeedEnqueue: