tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

[Question] Low TPU Utilization for tranformer_big_enfr_tpu #1205

Closed Boffee closed 5 years ago

Boffee commented 5 years ago

Description

I tried training a transformer model for translate_enfr_wmt32k_packed on a single TPU-v2 (8 cores) using the default transformer-transformer_big_enfr_tpu hparams and noticed the TPU is heavily underutilized. The cloud_tpu_profiler tool reported only 18% TPU utilization because 69% of the time was spent on OutfeedEnqueueTuple operations. Looking through the trace_viewer, it seems that OutfeedEnqueueTuple is some sort of transition from loss to gradient computation because it is always preceded by a cross entropy op and followed by a softmax gradient op.

How can I reduce the time spent in OutfeedEnqueueTuple and what is it actually doing that takes so long? This is the only documentation I was able to find on it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/tpu/ops/outfeed_ops.cc#L67.

Environment information

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.5 LTS
Release:    16.04
Codename:   xenial
Base Image: tensorflow/tensorflow:1.11.0-py3
GKE Version:    1.11.2-gke.9
Machine Type:   n1-standard-8
TPU Version:    v2

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.11.0
tensorflow==1.11.0

$ python -V
Python 2.7.12

$ python3 -V
Python 3.5.2

Steps to reproduce:

t2t-trainer \
  --model=transformer \
  --problem=translate_enfr_wmt32k_packed \
  --hparams_set=transformer_big_enfr_tpu \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --use_tpu=True \
  --cloud_tpu_name=$KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS

capture_tpu_profile \
  --service_addr $KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS \
  --logdir $OUT_DIR \
  --duration_ms 60000

Logs:

Input pipeline analysis: image

TPU Utilization: image

Full trace: image

Op before OutfeedEnqueue: image

Op after OutfeedEnqueue: image

Boffee commented 5 years ago

Looks like this was caused by the host_call function in the TPUEstimator, which was always enabled in tensor2tensor version 1.9.0 and now defaults to disabled in version 1.10.0.

From my basic understanding, the outfeed queue is used to store the output of XLA compiled graphs that they can be access by other ops and XLA graphs. The host_call function is slow because it copies the data from the TPU to the host machine (CPU) on very iteration, which is very expensive.