tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.57k stars 3.51k forks source link

Training won't start #1668

Open Victor-Almeida opened 5 years ago

Victor-Almeida commented 5 years ago

Description

Hello.

I'm trying to run tensor2tensor on Google Colab using a GPU environment, but it gets stuck after loading dynamic library libcublas.

!pip3 install --upgrade tensorflow-gpu
!pip3 install --upgrade tensor2tensor
!pip3 install pydub
!apt -qq install -y ffmpeg
!apt -qq install -y sox

from google.colab import drive
drive.mount('/content/gdrive/')

!t2t-trainer \
    --tmp_dir='/content/gdrive/My Drive/TCC/T2T LibriSpeech/tmp' \
    --problem='librispeech_clean_small' \
    --model='lstm_seq2seq' \
    --train_steps=100 \
    --hparams_set='lstm_seq2seq' \
    --data_dir='/content/gdrive/My Drive/TCC/T2T LibriSpeech/data/' \
    --output_dir='/content/gdrive/My Drive/TCC/T2T LibriSpeech/output' \
    --hparams="optimizer = rms_prop, learning_rate_schedule = rsqrt_decay" \
    --worker_gpu=1

Here's the terminal output :

WARNING: Logging before flag parsing goes to stderr.
W0821 05:27:27.318480 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py:68: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0821 05:27:28.270877 139658590254976 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0821 05:27:29.911124 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/adafactor.py:27: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0821 05:27:29.911762 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/multistep_optimizer.py:32: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

W0821 05:27:29.923251 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py:4237: The name tf.train.CheckpointSaverListener is deprecated. Please use tf.estimator.CheckpointSaverListener instead.

W0821 05:27:29.923428 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py:4260: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

W0821 05:27:29.952985 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/rl/gym_utils.py:219: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0821 05:27:29.984675 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:109: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

W0821 05:27:30.376930 139658590254976 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:32: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0821 05:27:30.377137 139658590254976 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:32: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0821 05:27:30.377253 139658590254976 deprecation_wrapper.py:119] From /usr/local/bin/t2t-trainer:33: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

W0821 05:27:30.378021 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/hparams_lib.py:49: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

I0821 05:27:30.379424 139658590254976 hparams_lib.py:64] Loading hparams from existing json /content/gdrive/My Drive/TCC/T2T LibriSpeech/output/hparams.json
W0821 05:27:30.379606 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/hparams_lib.py:65: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

I0821 05:27:30.381827 139658590254976 hparams_lib.py:85] Overwrite key batch_size: 1024 -> 1000
I0821 05:27:30.381956 139658590254976 hparams_lib.py:85] Overwrite key learning_rate_schedule: legacy -> rsqrt_decay
I0821 05:27:30.382053 139658590254976 hparams_lib.py:85] Overwrite key optimizer: adam -> rms_prop
I0821 05:27:30.382306 139658590254976 hparams_lib.py:55] Overriding hparams in lstm_seq2seq with optimizer = rms_prop, learning_rate_schedule = rsqrt_decay
W0821 05:27:30.382661 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:780: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0821 05:27:30.383641 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:121: The name tf.GraphOptions is deprecated. Please use tf.compat.v1.GraphOptions instead.

W0821 05:27:30.383837 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:127: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

W0821 05:27:30.384021 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/trainer_lib.py:240: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
I0821 05:27:30.384210 139658590254976 trainer_lib.py:263] Configuring DataParallelism to replicate the model.
I0821 05:27:30.384292 139658590254976 devices.py:76] schedule=continuous_train_and_eval
I0821 05:27:30.384358 139658590254976 devices.py:77] worker_gpu=1
I0821 05:27:30.384418 139658590254976 devices.py:78] sync=False
W0821 05:27:30.384511 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/devices.py:139: The name tf.logging.warn is deprecated. Please use tf.compat.v1.logging.warn instead.

W0821 05:27:30.384588 139658590254976 devices.py:141] Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
I0821 05:27:30.385225 139658590254976 devices.py:170] datashard_devices: ['gpu:0']
I0821 05:27:30.385297 139658590254976 devices.py:171] caching_devices: None
I0821 05:27:30.385771 139658590254976 devices.py:172] ps_devices: ['gpu:0']
I0821 05:27:30.386448 139658590254976 estimator.py:209] Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f04606f64a8>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_protocol': None, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
    global_jit_level: OFF
  }
}
isolate_session_state: true
, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/content/gdrive/My Drive/TCC/T2T LibriSpeech/output', 'use_tpu': False, 't2t_device_info': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f04606f6518>}
W0821 05:27:30.386659 139658590254976 model_fn.py:630] Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f046076bd08>) includes params argument, but params are not passed to Estimator.
W0821 05:27:30.387291 139658590254976 trainer_lib.py:724] ValidationMonitor only works with --schedule=train_and_evaluate
I0821 05:27:30.399102 139658590254976 estimator_training.py:186] Not using Distribute Coordinator.
I0821 05:27:30.399344 139658590254976 training.py:612] Running training and evaluation locally (non-distributed).
I0821 05:27:30.399628 139658590254976 training.py:700] Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
W0821 05:27:30.411955 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0821 05:27:30.422870 139658590254976 problem.py:644] Reading data files from /content/gdrive/My Drive/TCC/T2T LibriSpeech/data/librispeech_clean_small-train*
I0821 05:27:30.446713 139658590254976 problem.py:670] partition: 0 num_data_files: 100
W0821 05:27:30.448990 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/data_generators/problem.py:680: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
W0821 05:27:30.634850 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/common_audio.py:92: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0821 05:27:30.758503 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/common_audio.py:115: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0821 05:27:30.951467 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:275: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
W0821 05:27:31.428636 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:395: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
W0821 05:27:31.429009 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:398: The name tf.logging.warning is deprecated. Please use tf.compat.v1.logging.warning instead.

W0821 05:27:31.429161 139658590254976 data_reader.py:399] Shapes are not fully defined. Assuming batch_size means tokens.
W0821 05:27:31.484967 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/grouping.py:193: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0821 05:27:31.532430 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/data_reader.py:231: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

I0821 05:27:31.598924 139658590254976 estimator.py:1145] Calling model_fn.
I0821 05:27:31.611495 139658590254976 t2t_model.py:2172] Setting T2TModel mode to 'train'
W0821 05:27:31.684887 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py:243: The name tf.summary.text is deprecated. Please use tf.compat.v1.summary.text instead.

I0821 05:27:32.464440 139658590254976 api.py:255] Using variable initializer: uniform_unit_scaling
I0821 05:27:33.088927 139658590254976 t2t_model.py:2172] Transforming feature 'inputs' with speech_recognition_modality.bottom
W0821 05:27:33.090807 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/layers/modalities.py:439: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.keras.layers.Conv2D` instead.
I0821 05:27:33.388660 139658590254976 t2t_model.py:2172] Transforming feature 'targets' with symbol_modality_256_128.targets_bottom
I0821 05:27:33.406166 139658590254976 t2t_model.py:2172] Building model body
W0821 05:27:33.421513 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/models/lstm.py:33: The name tf.nn.rnn_cell.DropoutWrapper is deprecated. Please use tf.compat.v1.nn.rnn_cell.DropoutWrapper instead.

W0821 05:27:33.421710 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/models/lstm.py:34: LSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
W0821 05:27:33.432483 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/models/lstm.py:62: MultiRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
W0821 05:27:33.432950 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/models/lstm.py:67: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
W0821 05:27:33.783363 139658590254976 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn_cell_impl.py:961: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
I0821 05:27:34.804927 139658590254976 t2t_model.py:2172] Transforming body output with symbol_modality_256_128.top
W0821 05:27:34.924930 139658590254976 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/learning_rate.py:107: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

I0821 05:27:34.932349 139658590254976 optimize.py:327] Trainable Variables Total size: 1677440
I0821 05:27:34.932615 139658590254976 optimize.py:327] Non-trainable variables Total size: 5
I0821 05:27:34.932751 139658590254976 optimize.py:182] Using optimizer rms_prop
I0821 05:27:34.934037 139658590254976 optimize.py:78] Clipping gradients, norm: 2.00000
W0821 05:27:36.683928 139658590254976 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/rmsprop.py:119: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
I0821 05:27:36.894444 139658590254976 estimator.py:1147] Done calling model_fn.
I0821 05:27:36.896018 139658590254976 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0821 05:27:37.520785 139658590254976 monitored_session.py:240] Graph was finalized.
2019-08-21 05:27:37.521225: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-21 05:27:37.526665: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-21 05:27:37.690962: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-21 05:27:37.691522: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x13459c0 executing computations on platform CUDA. Devices:
2019-08-21 05:27:37.691557: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-08-21 05:27:37.693788: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-08-21 05:27:37.694004: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1344a00 executing computations on platform Host. Devices:
2019-08-21 05:27:37.694041: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-21 05:27:37.694265: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-21 05:27:37.694626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-08-21 05:27:37.695007: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-21 05:27:37.696442: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-21 05:27:37.697717: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-08-21 05:27:37.698160: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-08-21 05:27:37.699885: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-08-21 05:27:37.701143: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-08-21 05:27:37.704909: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-21 05:27:37.705064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-21 05:27:37.705594: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-21 05:27:37.706305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-21 05:27:37.706401: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-21 05:27:37.707871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-21 05:27:37.707903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-08-21 05:27:37.707917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-08-21 05:27:37.708211: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-21 05:27:37.708615: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-21 05:27:37.708965: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:40] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2019-08-21 05:27:37.709011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14325 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
W0821 05:27:37.711665 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I0821 05:27:37.713894 139658590254976 saver.py:1280] Restoring parameters from /content/gdrive/My Drive/TCC/T2T LibriSpeech/output/model.ckpt-0
W0821 05:27:38.503897 139658590254976 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1066: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
2019-08-21 05:27:38.591958: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0821 05:27:38.599505 139658590254976 session_manager.py:500] Running local_init_op.
I0821 05:27:38.638172 139658590254976 session_manager.py:502] Done running local_init_op.
I0821 05:27:40.499942 139658590254976 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /content/gdrive/My Drive/TCC/T2T LibriSpeech/output/model.ckpt.
2019-08-21 05:27:42.511778: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

Any ideas on how to solve it?

lukaszkaiser commented 5 years ago

I think the best idea is to report on the TF and google colab lists as this does not look like an error specific to T2T.