tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.34k stars 3.47k forks source link

Decoding with transformer_moe hangs before it starts on libcudnn.so.7 #1670

Closed amin-nejad closed 3 years ago

amin-nejad commented 5 years ago

Description

Decoding hangs on Successfully opened dynamic library libcudnn.so.7. Occurs even on a new VM instance (Azure) with all the requirements just installed. Using CUDA-10.1 and a Tesla K80 GPU. Reduced batch_size to just 1 which takes a couple of minutes on CPU, but seems to last indefinitely on GPU (at least an hour and a half, not waited longer).

Environment information

OS: Ubuntu 18.04

$ pip freeze | grep tensor

mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-datasets==1.2.0
tensorflow-estimator==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0

$ python -V

Python 3.7.3

For bugs: reproduction and error logs

# Steps to reproduce:

CUDA_VISIBLE_DEVICES=0 t2t-decoder \
  --t2t_usr_dir=$USR_DIR \
  --data_dir=$DATA_DIR \
  --problem=mimic_discharge_summaries \
  --model=transformer_moe \
  --hparams_set=transformer_moe_base \
  --output_dir=$OUTPUT_DIR \
  --decode_hparams="beam_size=3,alpha=0.6,batch_size=1" \
  --decode_from_file=$DIR/src-test.txt \
  --decode_to_file=output.txt &
# Error logs:

WARNING: Logging before flag parsing goes to stderr.
W0821 20:58:49.202146 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/expert_utils.py:68: The name tf.variable_scope is deprecated. Please\
 use tf.compat.v1.variable_scope instead.

W0821 20:58:49.964024 140600132396864 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0821 20:58:51.827588 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/adafactor.py:27: The name tf.train.Optimizer is deprecated. Please u\
se tf.compat.v1.train.Optimizer instead.

W0821 20:58:51.827997 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/multistep_optimizer.py:32: The name tf.train.AdamOptimizer is deprec\
ated. Please use tf.compat.v1.train.AdamOptimizer instead.

W0821 20:58:51.838359 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/mesh_tensorflow/ops.py:4237: The name tf.train.CheckpointSaverListener is deprecated. Pl\
ease use tf.estimator.CheckpointSaverListener instead.

W0821 20:58:51.838499 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/mesh_tensorflow/ops.py:4260: The name tf.train.SessionRunHook is deprecated. Please use \
tf.estimator.SessionRunHook instead.

W0821 20:58:51.867746 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/rl/gym_utils.py:219: The name tf.logging.info is deprecated. Please use tf\
.compat.v1.logging.info instead.

W0821 20:58:51.894030 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:109: The name tf.OptimizerOptions is deprecated. Plea\
se use tf.compat.v1.OptimizerOptions instead.

W0821 20:58:52.270897 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/bin/t2t-decoder:16: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity i\
nstead.

W0821 20:58:52.271061 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/bin/t2t-decoder:16: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0821 20:58:52.271208 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/bin/t2t-decoder:17: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W0821 20:58:52.271745 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:780: The name tf.set_random_seed is deprecated. Pleas\
e use tf.compat.v1.set_random_seed instead.

I0821 20:58:52.272333 140600132396864 usr_dir.py:43] Importing user module transformer_moe from path /home/aa5118/project/text-generation
W0821 20:58:52.273171 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/data_generators/text_encoder.py:938: The name tf.gfile.Exists is deprecate\
d. Please use tf.io.gfile.exists instead.

W0821 20:58:52.273323 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/data_generators/text_encoder.py:940: The name tf.gfile.Open is deprecated.\
 Please use tf.io.gfile.GFile instead.

W0821 20:58:52.382671 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:121: The name tf.GraphOptions is deprecated. Please u\
se tf.compat.v1.GraphOptions instead.

W0821 20:58:52.382863 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:127: The name tf.GPUOptions is deprecated. Please use\
 tf.compat.v1.GPUOptions instead.

W0821 20:58:52.383047 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/trainer_lib.py:240: RunConfig.__init__ (from tensorflow.contrib.learn.python\
.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
I0821 20:58:52.383207 140600132396864 trainer_lib.py:263] Configuring DataParallelism to replicate the model.
I0821 20:58:52.383281 140600132396864 devices.py:76] schedule=continuous_train_and_eval
I0821 20:58:52.383338 140600132396864 devices.py:77] worker_gpu=1
I0821 20:58:52.383389 140600132396864 devices.py:78] sync=False
W0821 20:58:52.383465 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/devices.py:139: The name tf.logging.warn is deprecated. Please use t\
f.compat.v1.logging.warn instead.

W0821 20:58:52.383523 140600132396864 devices.py:141] Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
I0821 20:58:52.383700 140600132396864 devices.py:170] datashard_devices: ['gpu:0']
I0821 20:58:52.383759 140600132396864 devices.py:171] caching_devices: None
I0821 20:58:52.383882 140600132396864 devices.py:172] ps_devices: ['gpu:0']
I0821 20:58:52.384373 140600132396864 estimator.py:209] Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdf92277cc0>, '_master'\
: '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay\
_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_protocol': None, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
    global_jit_level: OFF
  }
}
isolate_session_state: true
, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '../data/t2t_experiments/transformer_moe/full_context/data', 'use_tpu': False, 't2t_device_i\
nfo': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fdf92277e48>}
W0821 20:58:52.384535 140600132396864 model_fn.py:630] Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7fdf92c0ae18>) includes params argument, but params are no\
t passed to Estimator.
I0821 20:58:52.384735 140600132396864 decoding.py:415] Performing decoding from file (../data/preprocessed/src-test.txt).
I0821 20:58:52.384802 140600132396864 decoding.py:860] Getting sorted inputs
I0821 20:58:52.535063 140600132396864 estimator.py:612] Could not find trained model in model_dir: ../data/t2t_experiments/transformer_moe/full_context/data, running initialization to predict.
I0821 20:58:52.539939 140600132396864 decoding.py:673]  batch 5727
I0821 20:58:52.540030 140600132396864 decoding.py:675] Decoding batch 0
W0821 20:58:52.551722 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/decoding.py:617: py_func (from tensorflow.python.ops.script_ops) is deprecat\
ed and will be removed in a future version.

Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.

W0821 20:58:52.555120 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/decoding.py:950: to_int32 (from tensorflow.python.ops.math_ops) is deprecate\
d and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0821 20:58:52.560892 140600132396864 estimator.py:1000] Input graph does not use tf.data.Dataset or contain a QueueRunner. That means predict yields forever. This is probably a mistake.
I0821 20:58:52.561223 140600132396864 estimator.py:1145] Calling model_fn.
I0821 20:58:52.562110 140600132396864 t2t_model.py:2172] Setting T2TModel mode to 'infer'
I0821 20:58:52.562382 140600132396864 t2t_model.py:2172] Setting hparams.dropout to 0.0
I0821 20:58:52.562463 140600132396864 t2t_model.py:2172] Setting hparams.label_smoothing to 0.0
I0821 20:58:52.562537 140600132396864 t2t_model.py:2172] Setting hparams.layer_prepostprocess_dropout to 0.0
I0821 20:58:52.562601 140600132396864 t2t_model.py:2172] Setting hparams.symbol_dropout to 0.0
I0821 20:58:52.562671 140600132396864 t2t_model.py:2172] Setting hparams.attention_dropout to 0.0
I0821 20:58:52.562731 140600132396864 t2t_model.py:2172] Setting hparams.relu_dropout to 0.0
W0821 20:58:52.626073 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/t2t_model.py:243: The name tf.summary.text is deprecated. Please use\
 tf.compat.v1.summary.text instead.

I0821 20:58:52.789595 140600132396864 t2t_model.py:2172] Beam Decoding with beam size 3
W0821 20:58:52.852554 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/beam_search.py:744: to_float (from tensorflow.python.ops.math_ops) is deprec\
ated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
I0821 20:58:53.508420 140600132396864 api.py:255] Using variable initializer: uniform_unit_scaling
W0821 20:58:53.544848 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensorflow/python/autograph/converters/directives.py:117: The name tf.summary.scalar is \
deprecated. Please use tf.compat.v1.summary.scalar instead.

I0821 20:58:53.820600 140600132396864 t2t_model.py:2172] Transforming feature 'inputs' with symbol_modality_32895_512.bottom
I0821 20:58:53.841698 140600132396864 t2t_model.py:2172] Transforming feature 'targets' with symbol_modality_32895_512.targets_bottom
W0821 20:58:53.926082 140600132396864 deprecation.py:506] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/models/research/transformer_moe.py:194: calling dropout (from tensorflow.python.op\
s.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
W0821 20:58:54.023731 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/layers/common_layers.py:3106: The name tf.layers.Dense is deprecated. Plea\
se use tf.compat.v1.layers.Dense instead.

W0821 20:58:54.594940 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/layers/common_layers.py:556: The name tf.layers.Conv2D is deprecated. Plea\
se use tf.compat.v1.layers.Conv2D instead.

I0821 20:58:58.462035 140600132396864 t2t_model.py:2172] Transforming body output with symbol_modality_32895_512.top
W0821 20:58:58.584813 140600132396864 deprecation.py:323] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:2403: add_dispatch_support.<locals>.wrapper (from tensorflow.p\
ython.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0821 20:58:58.689076 140600132396864 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/lib/python3.7/site-packages/tensor2tensor/utils/t2t_model.py:1734: The name tf.saved_model.signature_constants.DEFAU\
LT_SERVING_SIGNATURE_DEF_KEY is deprecated. Please use tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY instead.

I0821 20:58:58.689570 140600132396864 estimator.py:1147] Done calling model_fn.
I0821 20:58:59.073214 140600132396864 monitored_session.py:240] Graph was finalized.
2019-08-21 20:58:59.073555: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-08-21 20:58:59.082555: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596990000 Hz
2019-08-21 20:58:59.085153: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d8a72a1be0 executing computations on platform Host. Devices:
2019-08-21 20:58:59.085179: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-21 20:58:59.087430: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-21 20:59:05.611072: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55d8a9ba6c40 executing computations on platform CUDA. Devices:
2019-08-21 20:59:05.611116: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-08-21 20:59:05.611986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 3a40:00:00.0
2019-08-21 20:59:05.612276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-08-21 20:59:05.614128: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-08-21 20:59:05.615971: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2019-08-21 20:59:05.616280: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2019-08-21 20:59:05.618149: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2019-08-21 20:59:05.619244: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2019-08-21 20:59:05.623303: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-21 20:59:05.624801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-21 20:59:05.624856: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2019-08-21 20:59:05.627788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-21 20:59:05.627809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-08-21 20:59:05.627817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-08-21 20:59:05.629403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11596 MB memory) -> physical GPU (device: 0, nam\
e: Tesla K80, pci bus id: 3a40:00:00.0, compute capability: 3.7)
2019-08-21 20:59:06.866570: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If yo\
u want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLA\
GS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0821 20:59:06.885300 140600132396864 session_manager.py:500] Running local_init_op.
I0821 20:59:06.933789 140600132396864 session_manager.py:502] Done running local_init_op.
2019-08-21 20:59:08.180917: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-08-21 20:59:08.816787: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
cantwbr commented 5 years ago

Might be related to tensorflow/tensorflow#32017

amin-nejad commented 5 years ago

Thanks @cantwbr , possibly - will keep an eye on it. But this doesn't even get to checkpoint

cantwbr commented 5 years ago

Thanks @cantwbr , possibly - will keep an eye on it. But this doesn't even get to checkpoint

@amin-nejad: You are right! I think the title of issue tensorflow/tensorflow#32017 is a bit misleading. The execution reported there actually stalls after opening libcublas - just like in the execution you reported.