Decoding with transformer_moe hangs before it starts on libcudnn.so.7

Description

Decoding hangs on Successfully opened dynamic library libcudnn.so.7. Occurs even on a new VM instance (Azure) with all the requirements just installed. Using CUDA-10.1 and a Tesla K80 GPU. Reduced batch_size to just 1 which takes a couple of minutes on CPU, but seems to last indefinitely on GPU (at least an hour and a half, not waited longer).

Environment information

OS: Ubuntu 18.04

$ pip freeze | grep tensor

mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-datasets==1.2.0
tensorflow-estimator==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0

$ python -V

Python 3.7.3

For bugs: reproduction and error logs

# Steps to reproduce:

CUDA_VISIBLE_DEVICES=0 t2t-decoder \
  --t2t_usr_dir=$USR_DIR \
  --data_dir=$DATA_DIR \
  --problem=mimic_discharge_summaries \
  --model=transformer_moe \
  --hparams_set=transformer_moe_base \
  --output_dir=$OUTPUT_DIR \
  --decode_hparams="beam_size=3,alpha=0.6,batch_size=1" \
  --decode_from_file=$DIR/src-test.txt \
  --decode_to_file=output.txt &

# Error logs:

WARNING: Logging before flag parsing goes to stderr.
W0821 20:58:49.202146 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/expert_utils.py:68: The name tf.variable_scope is deprecated. Please\
 use tf.compat.v1.variable_scope instead.

W0821 20:58:49.964024 140600132396864 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0821 20:58:51.827588 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/adafactor.py:27: The name tf.train.Optimizer is deprecated. Please u\
se tf.compat.v1.train.Optimizer instead.

W0821 20:58:51.827997 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/multistep_optimizer.py:32: The name tf.train.AdamOptimizer is deprec\
ated. Please use tf.compat.v1.train.AdamOptimizer instead.

W0821 20:58:51.838359 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/mesh_tensorflow/ops.py:4237: The name tf.train.CheckpointSaverListener is deprecated. Pl\
ease use tf.estimator.CheckpointSaverListener instead.

W0821 20:58:51.838499 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/mesh_tensorflow/ops.py:4260: The name tf.train.SessionRunHook is deprecated. Please use \
tf.estimator.SessionRunHook instead.

W0821 20:58:51.867746 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/rl/gym_utils.py:219: The name tf.logging.info is deprecated. Please use tf\
.compat.v1.logging.info instead.

W0821 20:58:51.894030 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:109: The name tf.OptimizerOptions is deprecated. Plea\
se use tf.compat.v1.OptimizerOptions instead.

W0821 20:58:52.270897 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/bin/t2t-decoder:16: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity i\
nstead.

W0821 20:58:52.271061 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/bin/t2t-decoder:16: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0821 20:58:52.271208 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/bin/t2t-decoder:17: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W0821 20:58:52.271745 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:780: The name tf.set_random_seed is deprecated. Pleas\
e use tf.compat.v1.set_random_seed instead.

I0821 20:58:52.272333 140600132396864 usr_dir.py:43] Importing user module transformer_moe from path /home/aa5118/project/text-generation
W0821 20:58:52.273171 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/data_generators/text_encoder.py:938: The name tf.gfile.Exists is deprecate\
d. Please use tf.io.gfile.exists instead.

W0821 20:58:52.273323 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/data_generators/text_encoder.py:940: The name tf.gfile.Open is deprecated.\
 Please use tf.io.gfile.GFile instead.

W0821 20:58:52.382671 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:121: The name tf.GraphOptions is deprecated. Please u\
se tf.compat.v1.GraphOptions instead.

W0821 20:58:52.382863 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:127: The name tf.GPUOptions is deprecated. Please use\
 tf.compat.v1.GPUOptions instead.

W0821 20:58:52.383047 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:240: RunConfig.__init__ (from tensorflow.contrib.learn.python\
.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
I0821 20:58:52.383207 140600132396864 trainer_lib.py:263] Configuring DataParallelism to replicate the model.
I0821 20:58:52.383281 140600132396864 devices.py:76] schedule=continuous_train_and_eval
I0821 20:58:52.383338 140600132396864 devices.py:77] worker_gpu=1
I0821 20:58:52.383389 140600132396864 devices.py:78] sync=False
W0821 20:58:52.383465 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/devices.py:139: The name tf.logging.warn is deprecated. Please use t\
f.compat.v1.logging.warn instead.

W0821 20:58:52.383523 140600132396864 devices.py:141] Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
I0821 20:58:52.383700 140600132396864 devices.py:170] datashard_devices: ['gpu:0']
I0821 20:58:52.383759 140600132396864 devices.py:171] caching_devices: None
I0821 20:58:52.383882 140600132396864 devices.py:172] ps_devices: ['gpu:0']
I0821 20:58:52.384373 140600132396864 estimator.py:209] Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdf92277cc0>, '_master'\
: '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay\
_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_protocol': None, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
    global_jit_level: OFF
  }
}
isolate_session_state: true
, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '../data/t2t_experiments/transformer_moe/full_context/data', 'use_tpu': False, 't2t_device_i\
nfo': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fdf92277e48>}
W0821 20:58:52.384535 140600132396864 model_fn.py:630] Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7fdf92c0ae18>) includes params argument, but params are no\
t passed to Estimator.
I0821 20:58:52.384735 140600132396864 decoding.py:415] Performing decoding from file (../data/preprocessed/src-test.txt).
I0821 20:58:52.384802 140600132396864 decoding.py:860] Getting sorted inputs
I0821 20:58:52.535063 140600132396864 estimator.py:612] Could not find trained model in model_dir: ../data/t2t_experiments/transformer_moe/full_context/data, running initialization to predict.
I0821 20:58:52.539939 140600132396864 decoding.py:673]  batch 5727
I0821 20:58:52.540030 140600132396864 decoding.py:675] Decoding batch 0
W0821 20:58:52.551722 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/decoding.py:617: py_func (from tensorflow.python.ops.script_ops) is deprecat\
ed and will be removed in a future version.

Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.

W0821 20:58:52.555120 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/decoding.py:950: to_int32 (from tensorflow.python.ops.math_ops) is deprecate\
d and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0821 20:58:52.560892 140600132396864 estimator.py:1000] Input graph does not use tf.data.Dataset or contain a QueueRunner. That means predict yields forever. This is probably a mistake.
I0821 20:58:52.561223 140600132396864 estimator.py:1145] Calling model_fn.
I0821 20:58:52.562110 140600132396864 t2t_model.py:2172] Setting T2TModel mode to 'infer'
I0821 20:58:52.562382 140600132396864 t2t_model.py:2172] Setting hparams.dropout to 0.0
I0821 20:58:52.562463 140600132396864 t2t_model.py:2172] Setting hparams.label_smoothing to 0.0
I0821 20:58:52.562537 140600132396864 t2t_model.py:2172] Setting hparams.layer_prepostprocess_dropout to 0.0
I0821 20:58:52.562601 140600132396864 t2t_model.py:2172] Setting hparams.symbol_dropout to 0.0
I0821 20:58:52.562671 140600132396864 t2t_model.py:2172] Setting hparams.attention_dropout to 0.0
I0821 20:58:52.562731 140600132396864 t2t_model.py:2172] Setting hparams.relu_dropout to 0.0
W0821 20:58:52.626073 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/t2t_model.py:243: The name tf.summary.text is deprecated. Please use\
 tf.compat.v1.summary.text instead.

I0821 20:58:52.789595 140600132396864 t2t_model.py:2172] Beam Decoding with beam size 3
W0821 20:58:52.852554 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/beam_search.py:744: to_float (from tensorflow.python.ops.math_ops) is deprec\
ated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
I0821 20:58:53.508420 140600132396864 api.py:255] Using variable initializer: uniform_unit_scaling
W0821 20:58:53.544848 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensorflow/python/autograph/converters/directives.py:117: The name tf.summary.scalar is \
deprecated. Please use tf.compat.v1.summary.scalar instead.

I0821 20:58:53.820600 140600132396864 t2t_model.py:2172] Transforming feature 'inputs' with symbol_modality_32895_512.bottom
I0821 20:58:53.841698 140600132396864 t2t_model.py:2172] Transforming feature 'targets' with symbol_modality_32895_512.targets_bottom
W0821 20:58:53.926082 140600132396864 deprecation.py:506] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/models/research/transformer_moe.py:194: calling dropout (from tensorflow.python.op\
s.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0821 20:58:54.023731 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/layers/common_layers.py:3106: The name tf.layers.Dense is deprecated. Plea\
se use tf.compat.v1.layers.Dense instead.

W0821 20:58:54.594940 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/layers/common_layers.py:556: The name tf.layers.Conv2D is deprecated. Plea\
se use tf.compat.v1.layers.Conv2D instead.

I0821 20:58:58.462035 140600132396864 t2t_model.py:2172] Transforming body output with symbol_modality_32895_512.top
W0821 20:58:58.584813 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:2403: add_dispatch_support.<locals>.wrapper (from tensorflow.p\
ython.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0821 20:58:58.689076 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/t2t_model.py:1734: The name tf.saved_model.signature_constants.DEFAU\
LT_SERVING_SIGNATURE_DEF_KEY is deprecated. Please use tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY instead.

I0821 20:58:58.689570 140600132396864 estimator.py:1147] Done calling model_fn.
I0821 20:58:59.073214 140600132396864 monitored_session.py:240] Graph was finalized.
2019-08-21 20:58:59.073555: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-08-21 20:58:59.082555: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596990000 Hz
2019-08-21 20:58:59.085153: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d8a72a1be0 executing computations on platform Host. Devices:
2019-08-21 20:58:59.085179: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-21 20:58:59.087430: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-21 20:59:05.611072: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d8a9ba6c40 executing computations on platform CUDA. Devices:
2019-08-21 20:59:05.611116: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-08-21 20:59:05.611986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 3a40:00:00.0
2019-08-21 20:59:05.612276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-08-21 20:59:05.614128: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-08-21 20:59:05.615971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-08-21 20:59:05.616280: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-08-21 20:59:05.618149: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-08-21 20:59:05.619244: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-08-21 20:59:05.623303: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-21 20:59:05.624801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-21 20:59:05.624856: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-08-21 20:59:05.627788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-21 20:59:05.627809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-08-21 20:59:05.627817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-08-21 20:59:05.629403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11596 MB memory) -> physical GPU (device: 0, nam\
e: Tesla K80, pci bus id: 3a40:00:00.0, compute capability: 3.7)
2019-08-21 20:59:06.866570: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If yo\
u want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLA\
GS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0821 20:59:06.885300 140600132396864 session_manager.py:500] Running local_init_op.
I0821 20:59:06.933789 140600132396864 session_manager.py:502] Done running local_init_op.
2019-08-21 20:59:08.180917: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-08-21 20:59:08.816787: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

tensorflow / tensor2tensor

Decoding with transformer_moe hangs before it starts on libcudnn.so.7 #1670

Description

Environment information

For bugs: reproduction and error logs