google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.11k stars 753 forks source link

CUDA_ERROR_OUT_OF_MEMORY: out of memory (on a GPU) #57

Closed danyaljj closed 4 years ago

danyaljj commented 4 years ago

Here is full log:

(env37_t5) danielk@aristo-server1 ~ $ t5_mesh_transformer  \ 
>   --model_dir="danielk-files/models" \
>   --t5_tfds_data_dir="danielk-files" \
>   --gin_file="dataset.gin" \
>   --gin_param="utils.run.mesh_shape = 'model:2,batch:1'" \ 
>   --gin_param="utils.run.mesh_devices = ['gpu:0', 'gpu:1']" \
>   --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \
>   --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin" \
>   --gin_param="batch_size=2"

WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-01-09 11:11:34.259764: W tensorflow/core/platform/cloud/google_auth_provider.cc:178] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The 
last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
INFO:tensorflow:model_type=bitransformer
I0109 11:11:35.254748 139976006838016 utils.py:1664] model_type=bitransformer
INFO:tensorflow:mode=train
I0109 11:11:35.254887 139976006838016 utils.py:1665] mode=train
INFO:tensorflow:sequence_length={'inputs': 512, 'targets': 512}
I0109 11:11:35.254942 139976006838016 utils.py:1666] sequence_length={'inputs': 512, 'targets': 512}
INFO:tensorflow:batch_size=2048
I0109 11:11:35.254985 139976006838016 utils.py:1667] batch_size=2048
INFO:tensorflow:train_steps=1000000000
I0109 11:11:35.255030 139976006838016 utils.py:1668] train_steps=1000000000
INFO:tensorflow:mesh_shape=model:2,batch:1
I0109 11:11:35.255067 139976006838016 utils.py:1669] mesh_shape=model:2,batch:1
INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
I0109 11:11:35.255102 139976006838016 utils.py:1670] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
INFO:tensorflow:Building TPUConfig with tpu_job_name=None
I0109 11:11:35.255166 139976006838016 utils.py:1685] Building TPUConfig with tpu_job_name=None
INFO:tensorflow:Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorfl
ow.python.training.server_lib.ClusterSpec object at 0x7f4debe16710>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for
_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0109 11:11:35.257782 139976006838016 estimator.py:212] Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorfl
ow.python.training.server_lib.ClusterSpec object at 0x7f4debe16710>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for
_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0109 11:11:35.258051 139976006838016 tpu_context.py:220] _TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
W0109 11:11:35.258131 139976006838016 tpu_context.py:222] eval_on_tpu ignored because use_tpu is False.
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0109 11:11:35.263432 139976006838016 deprecation.py:506] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0109 11:11:35.263689 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0109 11:11:35.269644 139976006838016 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0109 11:11:35.373311 139976006838016 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0109 11:11:35.379834 139976006838016 dataset_builder.py:273] Reusing dataset glue (danielk-files/glue/mrpc/0.0.2)
I0109 11:11:35.380300 139976006838016 dataset_builder.py:434] Constructing tf.data.Dataset for split train, from danielk-files/glue/mrpc/0.0.2
2020-01-09 11:11:35.972400: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-09 11:11:36.034612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.036360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro GV100 major: 7 minor: 0 memoryClockRate(GHz): 1.627
pciBusID: 0000:01:00.0
2020-01-09 11:11:36.036418: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.038354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:02:00.0
2020-01-09 11:11:36.038504: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 11:11:36.039339: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-09 11:11:36.040050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-09 11:11:36.040232: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-09 11:11:36.041161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-09 11:11:36.041870: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-09 11:11:36.044076: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-09 11:11:36.044185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.045984: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.047956: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.049633: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.051551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
W0109 11:11:37.271278 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
INFO:tensorflow:Calling model_fn.
I0109 11:11:38.479681 139976006838016 estimator.py:1148] Calling model_fn.
INFO:tensorflow:Running train on CPU
I0109 11:11:38.479841 139976006838016 tpu_estimator.py:3124] Running train on CPU
INFO:tensorflow:feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.480923 139976006838016 utils.py:374] feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensur
e tf.print executes in graph mode:

W0109 11:11:38.481014 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensur
e tf.print executes in graph mode:

INFO:tensorflow:feature inputs_position : Tensor("Reshape_1:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.482377 139976006838016 utils.py:374] feature inputs_position : Tensor("Reshape_1:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature targets : Tensor("Reshape_2:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.483691 139976006838016 utils.py:374] feature targets : Tensor("Reshape_2:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature targets_position : Tensor("Reshape_3:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.485010 139976006838016 utils.py:374] feature targets_position : Tensor("Reshape_3:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature inputs_segmentation : Tensor("Reshape_4:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.486300 139976006838016 utils.py:374] feature inputs_segmentation : Tensor("Reshape_4:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature targets_segmentation : Tensor("Reshape_5:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.487596 139976006838016 utils.py:374] feature targets_segmentation : Tensor("Reshape_5:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:serialize_num_microbatches: tokens_per_microbatch_per_replica=8192 batch_dim=Dimension(name='batch', size=2048) sequence_length={'inputs': 512, 'targets': 512} batch_per_replica=2048 num_microbatches=128
I0109 11:11:38.488244 139976006838016 utils.py:1483] serialize_num_microbatches: tokens_per_microbatch_per_replica=8192 batch_dim=Dimension(name='batch', size=2048) sequence_length={'inputs': 512, 'targets': 512} batch_per_replica=2048 num_microbatches=128
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
W0109 11:11:38.516117 139976006838016 ops.py:4022] Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
W0109 11:11:38.731440 139976006838016 ops.py:4022] Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
INFO:tensorflow:Trainable Variables            count: 99      Total size: 60506624         Total slice_size: 30261504
I0109 11:12:02.097769 139976006838016 ops.py:5656] Trainable Variables            count: 99      Total size: 60506624         Total slice_size: 30261504
INFO:tensorflow:All Variables                  count: 105     Total size: 60691328         Total slice_size: 30386880
I0109 11:12:02.098776 139976006838016 ops.py:5656] All Variables                  count: 105     Total size: 60691328         Total slice_size: 30386880
INFO:tensorflow:Create CheckpointSaverHook.
I0109 11:12:02.323129 139976006838016 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO:tensorflow:Done calling model_fn.
I0109 11:12:02.323394 139976006838016 estimator.py:1150] Done calling model_fn.
INFO:tensorflow:Starting the session.
I0109 11:12:05.723811 139976006838016 ops.py:5512] Starting the session.
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py:1475: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0109 11:12:05.893454 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py:1475: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:tensorflow:Graph was finalized.
I0109 11:12:06.051002 139976006838016 monitored_session.py:240] Graph was finalized.
2020-01-09 11:12:06.052934: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-09 11:12:06.061519: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600320000 Hz
2020-01-09 11:12:06.061715: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5622b19cd0e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-09 11:12:06.061729: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-09 11:12:06.279132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.292306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.294242: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5622af2a7120 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-01-09 11:12:06.294260: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro GV100, Compute Capability 7.0
2020-01-09 11:12:06.294267: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Quadro RTX 8000, Compute Capability 7.5
2020-01-09 11:12:06.294765: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.296223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro GV100 major: 7 minor: 0 memoryClockRate(GHz): 1.627
pciBusID: 0000:01:00.0
2020-01-09 11:12:06.296279: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.298011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:02:00.0
2020-01-09 11:12:06.298046: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 11:12:06.298064: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-09 11:12:06.298080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-09 11:12:06.298096: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-09 11:12:06.298111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-09 11:12:06.298126: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-09 11:12:06.298142: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-09 11:12:06.298197: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.299798: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.301686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.303165: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.304884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2020-01-09 11:12:06.305241: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 11:12:06.310865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-09 11:12:06.310878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 1
2020-01-09 11:12:06.310883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N N
2020-01-09 11:12:06.310886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   N N
2020-01-09 11:12:06.311424: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.312928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.314704: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.316186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30553 MB memory) -> physical GPU (device: 0, name: Quadro GV100, pci bus id: 0000:01:00.0, compute capability: 7.0)
2020-01-09 11:12:06.316632: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.318361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45978 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:02:00.0, compute capability: 7.5)
INFO:tensorflow:Restoring parameters from danielk-files/models/model.ckpt-0
I0109 11:12:06.319796 139976006838016 saver.py:1284] Restoring parameters from danielk-files/models/model.ckpt-0
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
W0109 11:12:09.913856 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0109 11:12:10.712485 139976006838016 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0109 11:12:11.262093 139976006838016 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Before copy master to slices.
I0109 11:12:12.044784 139976006838016 ops.py:5541] Before copy master to slices.
INFO:tensorflow:Done with copy master to slices.
I0109 11:12:13.903039 139976006838016 ops.py:5543] Done with copy master to slices.
INFO:tensorflow:Saving checkpoints for 0 into danielk-files/models/model.ckpt.
I0109 11:12:25.531368 139976006838016 basic_session_run_hooks.py:606] Saving checkpoints for 0 into danielk-files/models/model.ckpt.
INFO:tensorflow:Before Save.
I0109 11:12:25.541100 139976006838016 ops.py:5516] Before Save.
INFO:tensorflow:About to write a checkpoint
I0109 11:12:26.858080 139976006838016 ops.py:5518] About to write a checkpoint
INFO:tensorflow:Done writing checkpoint.
I0109 11:12:30.216168 139976006838016 ops.py:5521] Done writing checkpoint.

import feature targets[[[59 834 15 1169 15592 1 59 834 15 1169 15592 1 7072 1 7072 1 7072 1 7072 1 7072 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][7072 1 7072 1 7072 1 7072 1 59 834 15 1169 15592 1 7072 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]import feature inputs_segmentation[[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]

import feature targets_segmentation[[[1 1 1 1 1 1 2 2 2 2 2 2 3 3 4 4 5 5 6 6 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][1 1 2 2 3 3 4 4 5 5 5 5 5 5 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]
import feature inputs[[[3 51 52 102 75 7142 536 10 37 748 18 11706 26 10571 26 9 1824 27127 11507 3 5 196 4 4666 2490 4659 4848 1828 979 3 6 42 4097 2128 1093 3 6 12 1914 3840 11039 755 3 5 7142 357 10 37 10571 26 9 1824 13723 5538 6292 4357 927 1093 3 6 11 8 5150 3 184 21309 3 31 7 4728 3 27336 1093 3 5 1 3 51 52 102 75 7142 536 10 71 272 7075 13 944 42 756 19 1702 26676 3 117 604 42 756 19 1702 29329 3 5 7142 357 10 71 272 7075 344 209 19253 11 204 27336 19 1702 1389 3 6 147 944 19 1702 26676 11 604 42 2123 19 4802 38 29329 3 5 1 3 51 52 102 75 7142 536 10 451 47 3 10116 6962 26 16 368 1060 538 30 386 12052 13 7738 11 5563 1213 406 15794 3 5 7142 357 10 451 47 3 10116 6962 26 30 386 12052 13 511 18 19706 7738 11 5563 1213 406 15794 16 3 23748 1334 2215 173 3 5 1 3 51 52 102 75 7142 536 10 3 15944 2721 3 6 2449 28017 3 19448 12967 8 1149 4172 11675 6894 45 14617 3 5 7142 357 10 2721 1379 3 6 2449 28017 3 31 7 1476 12967 8 14617 240 1890 462 3 5 1 3 51 52 102 75 7142 536 10 216 243 8 962 13 5025 2298 53 12910 251 81 3 9 205 5 196 5 188 5 5502 47 96 3 9 182 2261 1052 96 24 225 36 96 6665 26 12 8 423 222 5996 96 57 8 6923 1775 3 5 7142 357 10 37 1945 1384 243 24 3 26177 12910 251 47 3 9 2261 1052 24 225 36 96 6665 26 12 8 423 222 5996 96 57 8 6923 1775 3 5 1 3 51 52 102 75 7142 536 10 9765 243 24 8 1025 18 19973 772 130 6737 57 5455 11 2289 1170 3 6 1101 1729 8175 11 3415 772 3 6 11 3798 4539 11 895 14609 3 5 7142 357 10 23686 11 2289 1170 3 6 1101 1729 8175 11 3415 772 11 1101 1170 16 4539 11 895 14609 10719 48 2893 3 31 7 772 3 5 1 3 51 52 102 75 7142 536 10 8979 53 7 81 750 268 1124 3798 15284 45 8 166 2893 3 6 15539 45 1283 12 6897 3 5 7142 357 10 15186 13 750 268 1124 3798 15284 3 6 8 4379 2086 243 3 6 15539 12 6897 45 1283 16 8 166 2893 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][3 51 52 102 75 7142 536 10 391 2999 687 3 6 3479 3 6 744 3 31 17 780 43 46 4917 16 8 7738 1567 3 6 5779 243 3 5 7142 357 10 391 2999 687 3 6 3479 3 6 405 59 43 3 9 6297 30 8 7738 1567 3 6 5779 243 3 5 1 3 51 52 102 75 7142 536 10 37 3202 2120 95 192 477 865 44 3 9 6440 1078 633 3119 550 3 6 8944 29 68 1346 3 6 11 2139 991 2095 12 160 3 12554 703 7472 127 3 5 7142 357 10 37 10319 764 716 227 8 3202 2120 95 44 3 9 6440 1078 633 3119 550 11 2139 991 2095 12 160 3 12554 703 7472 127 3 5 1 3 51 52 102 75 7142 536 10 12737 7 7048 47 5510 13139 28 2084 3 9094 45 3 9 142 904 3342 77 30 8 7584 3 31 7 2131 3010 3 5 7142 357 10 86 8388 3 6 227 4169 203 16 5714 3 6 12737 7 7048 47 13139 28 2084 3 9094 45 3 9 142 904 3342 77 30 8 7584 3 31 7 2131 3010 3 5 1 3 51 52 102 75 7142 536 10 6187 630 27575 3 6 113 4037 8 73 28062 26 239 2864 3 6 3725 2098 662 767 16 5714 30 1817 1778 15 152 127 12710 11 861 18339 3991 3 5 7142 357 10 6187 630 27575 3 6 113 4037 8 6016 239 124 3 6 2098 662 767 16 5714 30 1817 1778 15 152 127 12710 11 861 18 29 15 122 3437 3991 3 5 1 3 51 52 102 75 7142 536 10 1960 3 6 8 8183 25553 25093 2086 56 962 165 7469 30 125 2953 8 3125 3 5 7142 357 10 37 8183 25553 25093 2086 65 2681 8 9100 21 8 3125 2812 120 30 10571 9 3 5 1 3 51 52 102 75 7142 536 10 216 19 80 13 192 11882 30 8 874 18 12066 377 2823 3 6 11 3 88 19 3 9 1101 11223 13 11955 49 17524 581 2252 11 4390 6991 24 15108 16 221 75 4392 3786 3 5 7142 357 10 10400 102 7 3 6 80 13 192 11882 30 8 874 18 12066 5473 3 6 65 3 9951 21 11955 49 17524 581 2252 11 4390 6991 24 15108 16 221 75 4392 3786 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]

2020-01-09 11:13:56.070937: I tensorflow/compiler/jit/xla_compilation_cache.cc:238] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2020-01-09 11:15:58.087468: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.089614: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 34359738368
2020-01-09 11:15:58.090228: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 30923763712 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090249: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 30923763712
2020-01-09 11:15:58.090281: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 27831386112 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090289: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 27831386112
2020-01-09 11:15:58.090316: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 25048246272 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090325: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 25048246272
2020-01-09 11:15:58.090352: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 22543421440 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090361: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 22543421440
2020-01-09 11:15:58.090389: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 20289079296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090397: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 20289079296
2020-01-09 11:15:58.090424: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 18260170752 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090433: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 18260170752
2020-01-09 11:15:58.090468: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 16434153472 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090477: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 16434153472
2020-01-09 11:15:58.090504: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 14790737920 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090512: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 14790737920
2020-01-09 11:15:58.090537: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 13311664128 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090545: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 13311664128
2020-01-09 11:15:58.090572: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 11980496896 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090581: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 11980496896
2020-01-09 11:15:58.090608: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 10782446592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090616: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 10782446592
2020-01-09 11:15:58.090643: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 9704201216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090652: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 9704201216
2020-01-09 11:15:58.090679: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 8733780992 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090687: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8733780992
2020-01-09 11:15:58.090715: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7860402688 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090723: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7860402688
2020-01-09 11:15:58.090749: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7074362368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090758: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7074362368

Here are some additional information about the environment:

(env37_t5) danielk@aristo-server1 ~ $ echo "$LD_LIBRARY_PATH"
:/home/danielk/anaconda3/pkgs/cudatoolkit-10.0.130-0/lib/

(env37_t5) danielk@aristo-server1 ~ $ nvidia-smi 
Fri Jan 24 14:24:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro GV100        On   | 00000000:01:00.0 Off |                  Off |
| 65%   82C    P2   159W / 250W |  11154MiB / 32478MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:02:00.0 Off |                  Off |
| 33%   49C    P8    14W / 260W |   6830MiB / 48571MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

FYI @nalourie-ai2

nalourie-ai2 commented 4 years ago

I had a similar issue to this. If I run the code as instructed on a machine with GPUs, the code will compile with XLA, but then fails with CUDA out of memory errors (even on large GPUs with 48Gb of memory).

I also tried using the --gin_param="serialize_num_microbatches.tokens_per_microbatch_per_replica = 512 but still without luck.

Would it be possible to get the full environment (i.e., pip freeze, possibly OS and cuda versions) where the GPU code was made to work?

adarob commented 4 years ago

Have you tried with model:1 batch:2?

On Fri, Jan 24, 2020 at 2:39 PM Nicholas A. Lourie notifications@github.com wrote:

I had a similar issue to this. If I run the code as instructed on a machine with GPUs, the code will compile with XLA, but then fails with CUDA out of memory errors (even on large GPUs with 48Gb of memory).

I also tried using the --gin_param="serialize_num_microbatches.tokens_per_microbatch_per_replica = 512 but still without luck.

Would it be possible to get the full environment (i.e., pip freeze, possibly OS and cuda versions) where the GPU code was made to work?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/57?email_source=notifications&email_token=AAIJV2CSNDHEHQOGSSANHWDQ7NUYPA5CNFSM4KLM5F72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ4J34A#issuecomment-578330096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2CXQRSPF63B5DCZJVDQ7NUYPANCNFSM4KLM5F7Q .

marcelgwerder commented 4 years ago

I also do have memory issues when trying to fine-tune the small model on GPUs. I feel like no matter how I configure data/model parallelism and batch size the memory allocated on the GPU always looks the same and I always run into out of memory issues.

Do you have any information about what kind of setup T5 was tested on for GPU support?

adarob commented 4 years ago

This should be fixed in #148. Please reopen if not.

mikechen66 commented 3 years ago

I have used the following section of code to solve the the issue.

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 4GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)