CUDA_ERROR_OUT_OF_MEMORY: out of memory (on a GPU)

Here is full log:

(env37_t5) danielk@aristo-server1 ~ $ t5_mesh_transformer  \ 
>   --model_dir="danielk-files/models" \
>   --t5_tfds_data_dir="danielk-files" \
>   --gin_file="dataset.gin" \
>   --gin_param="utils.run.mesh_shape = 'model:2,batch:1'" \ 
>   --gin_param="utils.run.mesh_devices = ['gpu:0', 'gpu:1']" \
>   --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \
>   --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin" \
>   --gin_param="batch_size=2"

WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-01-09 11:11:34.259764: W tensorflow/core/platform/cloud/google_auth_provider.cc:178] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The 
last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
INFO:tensorflow:model_type=bitransformer
I0109 11:11:35.254748 139976006838016 utils.py:1664] model_type=bitransformer
INFO:tensorflow:mode=train
I0109 11:11:35.254887 139976006838016 utils.py:1665] mode=train
INFO:tensorflow:sequence_length={'inputs': 512, 'targets': 512}
I0109 11:11:35.254942 139976006838016 utils.py:1666] sequence_length={'inputs': 512, 'targets': 512}
INFO:tensorflow:batch_size=2048
I0109 11:11:35.254985 139976006838016 utils.py:1667] batch_size=2048
INFO:tensorflow:train_steps=1000000000
I0109 11:11:35.255030 139976006838016 utils.py:1668] train_steps=1000000000
INFO:tensorflow:mesh_shape=model:2,batch:1
I0109 11:11:35.255067 139976006838016 utils.py:1669] mesh_shape=model:2,batch:1
INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
I0109 11:11:35.255102 139976006838016 utils.py:1670] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
INFO:tensorflow:Building TPUConfig with tpu_job_name=None
I0109 11:11:35.255166 139976006838016 utils.py:1685] Building TPUConfig with tpu_job_name=None
INFO:tensorflow:Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorfl
ow.python.training.server_lib.ClusterSpec object at 0x7f4debe16710>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for
_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0109 11:11:35.257782 139976006838016 estimator.py:212] Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorfl
ow.python.training.server_lib.ClusterSpec object at 0x7f4debe16710>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for
_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0109 11:11:35.258051 139976006838016 tpu_context.py:220] _TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
W0109 11:11:35.258131 139976006838016 tpu_context.py:222] eval_on_tpu ignored because use_tpu is False.
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0109 11:11:35.263432 139976006838016 deprecation.py:506] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0109 11:11:35.263689 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0109 11:11:35.269644 139976006838016 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0109 11:11:35.373311 139976006838016 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0109 11:11:35.379834 139976006838016 dataset_builder.py:273] Reusing dataset glue (danielk-files/glue/mrpc/0.0.2)
I0109 11:11:35.380300 139976006838016 dataset_builder.py:434] Constructing tf.data.Dataset for split train, from danielk-files/glue/mrpc/0.0.2
2020-01-09 11:11:35.972400: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-09 11:11:36.034612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.036360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro GV100 major: 7 minor: 0 memoryClockRate(GHz): 1.627
pciBusID: 0000:01:00.0
2020-01-09 11:11:36.036418: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.038354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:02:00.0
2020-01-09 11:11:36.038504: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 11:11:36.039339: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-09 11:11:36.040050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-09 11:11:36.040232: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-09 11:11:36.041161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-09 11:11:36.041870: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-09 11:11:36.044076: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-09 11:11:36.044185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.045984: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.047956: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.049633: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:11:36.051551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
W0109 11:11:37.271278 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
INFO:tensorflow:Calling model_fn.
I0109 11:11:38.479681 139976006838016 estimator.py:1148] Calling model_fn.
INFO:tensorflow:Running train on CPU
I0109 11:11:38.479841 139976006838016 tpu_estimator.py:3124] Running train on CPU
INFO:tensorflow:feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.480923 139976006838016 utils.py:374] feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensur
e tf.print executes in graph mode:

W0109 11:11:38.481014 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/mesh_tensorflow-0.1.9-py3.7.egg/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensur
e tf.print executes in graph mode:

INFO:tensorflow:feature inputs_position : Tensor("Reshape_1:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.482377 139976006838016 utils.py:374] feature inputs_position : Tensor("Reshape_1:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature targets : Tensor("Reshape_2:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.483691 139976006838016 utils.py:374] feature targets : Tensor("Reshape_2:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature targets_position : Tensor("Reshape_3:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.485010 139976006838016 utils.py:374] feature targets_position : Tensor("Reshape_3:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature inputs_segmentation : Tensor("Reshape_4:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.486300 139976006838016 utils.py:374] feature inputs_segmentation : Tensor("Reshape_4:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:feature targets_segmentation : Tensor("Reshape_5:0", shape=(1, 2048, 512), dtype=int32)
I0109 11:11:38.487596 139976006838016 utils.py:374] feature targets_segmentation : Tensor("Reshape_5:0", shape=(1, 2048, 512), dtype=int32)
INFO:tensorflow:serialize_num_microbatches: tokens_per_microbatch_per_replica=8192 batch_dim=Dimension(name='batch', size=2048) sequence_length={'inputs': 512, 'targets': 512} batch_per_replica=2048 num_microbatches=128
I0109 11:11:38.488244 139976006838016 utils.py:1483] serialize_num_microbatches: tokens_per_microbatch_per_replica=8192 batch_dim=Dimension(name='batch', size=2048) sequence_length={'inputs': 512, 'targets': 512} batch_per_replica=2048 num_microbatches=128
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
W0109 11:11:38.516117 139976006838016 ops.py:4022] Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
W0109 11:11:38.731440 139976006838016 ops.py:4022] Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
INFO:tensorflow:Trainable Variables            count: 99      Total size: 60506624         Total slice_size: 30261504
I0109 11:12:02.097769 139976006838016 ops.py:5656] Trainable Variables            count: 99      Total size: 60506624         Total slice_size: 30261504
INFO:tensorflow:All Variables                  count: 105     Total size: 60691328         Total slice_size: 30386880
I0109 11:12:02.098776 139976006838016 ops.py:5656] All Variables                  count: 105     Total size: 60691328         Total slice_size: 30386880
INFO:tensorflow:Create CheckpointSaverHook.
I0109 11:12:02.323129 139976006838016 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
INFO:tensorflow:Done calling model_fn.
I0109 11:12:02.323394 139976006838016 estimator.py:1150] Done calling model_fn.
INFO:tensorflow:Starting the session.
I0109 11:12:05.723811 139976006838016 ops.py:5512] Starting the session.
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py:1475: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0109 11:12:05.893454 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py:1475: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:tensorflow:Graph was finalized.
I0109 11:12:06.051002 139976006838016 monitored_session.py:240] Graph was finalized.
2020-01-09 11:12:06.052934: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-09 11:12:06.061519: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600320000 Hz
2020-01-09 11:12:06.061715: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5622b19cd0e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-09 11:12:06.061729: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-09 11:12:06.279132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.292306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.294242: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5622af2a7120 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-01-09 11:12:06.294260: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro GV100, Compute Capability 7.0
2020-01-09 11:12:06.294267: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Quadro RTX 8000, Compute Capability 7.5
2020-01-09 11:12:06.294765: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.296223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro GV100 major: 7 minor: 0 memoryClockRate(GHz): 1.627
pciBusID: 0000:01:00.0
2020-01-09 11:12:06.296279: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.298011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:02:00.0
2020-01-09 11:12:06.298046: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 11:12:06.298064: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-09 11:12:06.298080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-09 11:12:06.298096: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-09 11:12:06.298111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-09 11:12:06.298126: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-09 11:12:06.298142: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-09 11:12:06.298197: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.299798: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.301686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.303165: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.304884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2020-01-09 11:12:06.305241: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 11:12:06.310865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-09 11:12:06.310878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 1
2020-01-09 11:12:06.310883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N N
2020-01-09 11:12:06.310886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   N N
2020-01-09 11:12:06.311424: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.312928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.314704: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.316186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30553 MB memory) -> physical GPU (device: 0, name: Quadro GV100, pci bus id: 0000:01:00.0, compute capability: 7.0)
2020-01-09 11:12:06.316632: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-09 11:12:06.318361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 45978 MB memory) -> physical GPU (device: 1, name: Quadro RTX 8000, pci bus id: 0000:02:00.0, compute capability: 7.5)
INFO:tensorflow:Restoring parameters from danielk-files/models/model.ckpt-0
I0109 11:12:06.319796 139976006838016 saver.py:1284] Restoring parameters from danielk-files/models/model.ckpt-0
WARNING:tensorflow:From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
W0109 11:12:09.913856 139976006838016 deprecation.py:323] From /home/danielk/anaconda3/envs/env37_t5/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0109 11:12:10.712485 139976006838016 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0109 11:12:11.262093 139976006838016 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Before copy master to slices.
I0109 11:12:12.044784 139976006838016 ops.py:5541] Before copy master to slices.
INFO:tensorflow:Done with copy master to slices.
I0109 11:12:13.903039 139976006838016 ops.py:5543] Done with copy master to slices.
INFO:tensorflow:Saving checkpoints for 0 into danielk-files/models/model.ckpt.
I0109 11:12:25.531368 139976006838016 basic_session_run_hooks.py:606] Saving checkpoints for 0 into danielk-files/models/model.ckpt.
INFO:tensorflow:Before Save.
I0109 11:12:25.541100 139976006838016 ops.py:5516] Before Save.
INFO:tensorflow:About to write a checkpoint
I0109 11:12:26.858080 139976006838016 ops.py:5518] About to write a checkpoint
INFO:tensorflow:Done writing checkpoint.
I0109 11:12:30.216168 139976006838016 ops.py:5521] Done writing checkpoint.

import feature targets[[[59 834 15 1169 15592 1 59 834 15 1169 15592 1 7072 1 7072 1 7072 1 7072 1 7072 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][7072 1 7072 1 7072 1 7072 1 59 834 15 1169 15592 1 7072 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]import feature inputs_segmentation[[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]

import feature targets_segmentation[[[1 1 1 1 1 1 2 2 2 2 2 2 3 3 4 4 5 5 6 6 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][1 1 2 2 3 3 4 4 5 5 5 5 5 5 6 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]
import feature inputs[[[3 51 52 102 75 7142 536 10 37 748 18 11706 26 10571 26 9 1824 27127 11507 3 5 196 4 4666 2490 4659 4848 1828 979 3 6 42 4097 2128 1093 3 6 12 1914 3840 11039 755 3 5 7142 357 10 37 10571 26 9 1824 13723 5538 6292 4357 927 1093 3 6 11 8 5150 3 184 21309 3 31 7 4728 3 27336 1093 3 5 1 3 51 52 102 75 7142 536 10 71 272 7075 13 944 42 756 19 1702 26676 3 117 604 42 756 19 1702 29329 3 5 7142 357 10 71 272 7075 344 209 19253 11 204 27336 19 1702 1389 3 6 147 944 19 1702 26676 11 604 42 2123 19 4802 38 29329 3 5 1 3 51 52 102 75 7142 536 10 451 47 3 10116 6962 26 16 368 1060 538 30 386 12052 13 7738 11 5563 1213 406 15794 3 5 7142 357 10 451 47 3 10116 6962 26 30 386 12052 13 511 18 19706 7738 11 5563 1213 406 15794 16 3 23748 1334 2215 173 3 5 1 3 51 52 102 75 7142 536 10 3 15944 2721 3 6 2449 28017 3 19448 12967 8 1149 4172 11675 6894 45 14617 3 5 7142 357 10 2721 1379 3 6 2449 28017 3 31 7 1476 12967 8 14617 240 1890 462 3 5 1 3 51 52 102 75 7142 536 10 216 243 8 962 13 5025 2298 53 12910 251 81 3 9 205 5 196 5 188 5 5502 47 96 3 9 182 2261 1052 96 24 225 36 96 6665 26 12 8 423 222 5996 96 57 8 6923 1775 3 5 7142 357 10 37 1945 1384 243 24 3 26177 12910 251 47 3 9 2261 1052 24 225 36 96 6665 26 12 8 423 222 5996 96 57 8 6923 1775 3 5 1 3 51 52 102 75 7142 536 10 9765 243 24 8 1025 18 19973 772 130 6737 57 5455 11 2289 1170 3 6 1101 1729 8175 11 3415 772 3 6 11 3798 4539 11 895 14609 3 5 7142 357 10 23686 11 2289 1170 3 6 1101 1729 8175 11 3415 772 11 1101 1170 16 4539 11 895 14609 10719 48 2893 3 31 7 772 3 5 1 3 51 52 102 75 7142 536 10 8979 53 7 81 750 268 1124 3798 15284 45 8 166 2893 3 6 15539 45 1283 12 6897 3 5 7142 357 10 15186 13 750 268 1124 3798 15284 3 6 8 4379 2086 243 3 6 15539 12 6897 45 1283 16 8 166 2893 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][3 51 52 102 75 7142 536 10 391 2999 687 3 6 3479 3 6 744 3 31 17 780 43 46 4917 16 8 7738 1567 3 6 5779 243 3 5 7142 357 10 391 2999 687 3 6 3479 3 6 405 59 43 3 9 6297 30 8 7738 1567 3 6 5779 243 3 5 1 3 51 52 102 75 7142 536 10 37 3202 2120 95 192 477 865 44 3 9 6440 1078 633 3119 550 3 6 8944 29 68 1346 3 6 11 2139 991 2095 12 160 3 12554 703 7472 127 3 5 7142 357 10 37 10319 764 716 227 8 3202 2120 95 44 3 9 6440 1078 633 3119 550 11 2139 991 2095 12 160 3 12554 703 7472 127 3 5 1 3 51 52 102 75 7142 536 10 12737 7 7048 47 5510 13139 28 2084 3 9094 45 3 9 142 904 3342 77 30 8 7584 3 31 7 2131 3010 3 5 7142 357 10 86 8388 3 6 227 4169 203 16 5714 3 6 12737 7 7048 47 13139 28 2084 3 9094 45 3 9 142 904 3342 77 30 8 7584 3 31 7 2131 3010 3 5 1 3 51 52 102 75 7142 536 10 6187 630 27575 3 6 113 4037 8 73 28062 26 239 2864 3 6 3725 2098 662 767 16 5714 30 1817 1778 15 152 127 12710 11 861 18339 3991 3 5 7142 357 10 6187 630 27575 3 6 113 4037 8 6016 239 124 3 6 2098 662 767 16 5714 30 1817 1778 15 152 127 12710 11 861 18 29 15 122 3437 3991 3 5 1 3 51 52 102 75 7142 536 10 1960 3 6 8 8183 25553 25093 2086 56 962 165 7469 30 125 2953 8 3125 3 5 7142 357 10 37 8183 25553 25093 2086 65 2681 8 9100 21 8 3125 2812 120 30 10571 9 3 5 1 3 51 52 102 75 7142 536 10 216 19 80 13 192 11882 30 8 874 18 12066 377 2823 3 6 11 3 88 19 3 9 1101 11223 13 11955 49 17524 581 2252 11 4390 6991 24 15108 16 221 75 4392 3786 3 5 7142 357 10 10400 102 7 3 6 80 13 192 11882 30 8 874 18 12066 5473 3 6 65 3 9951 21 11955 49 17524 581 2252 11 4390 6991 24 15108 16 221 75 4392 3786 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]

2020-01-09 11:13:56.070937: I tensorflow/compiler/jit/xla_compilation_cache.cc:238] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2020-01-09 11:15:58.087468: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.089614: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 34359738368
2020-01-09 11:15:58.090228: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 30923763712 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090249: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 30923763712
2020-01-09 11:15:58.090281: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 27831386112 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090289: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 27831386112
2020-01-09 11:15:58.090316: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 25048246272 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090325: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 25048246272
2020-01-09 11:15:58.090352: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 22543421440 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090361: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 22543421440
2020-01-09 11:15:58.090389: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 20289079296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090397: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 20289079296
2020-01-09 11:15:58.090424: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 18260170752 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090433: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 18260170752
2020-01-09 11:15:58.090468: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 16434153472 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090477: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 16434153472
2020-01-09 11:15:58.090504: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 14790737920 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090512: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 14790737920
2020-01-09 11:15:58.090537: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 13311664128 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090545: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 13311664128
2020-01-09 11:15:58.090572: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 11980496896 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090581: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 11980496896
2020-01-09 11:15:58.090608: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 10782446592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090616: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 10782446592
2020-01-09 11:15:58.090643: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 9704201216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090652: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 9704201216
2020-01-09 11:15:58.090679: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 8733780992 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090687: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8733780992
2020-01-09 11:15:58.090715: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7860402688 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090723: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7860402688
2020-01-09 11:15:58.090749: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to alloc 7074362368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-01-09 11:15:58.090758: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 7074362368

Here are some additional information about the environment:

(env37_t5) danielk@aristo-server1 ~ $ echo "$LD_LIBRARY_PATH"
:/home/danielk/anaconda3/pkgs/cudatoolkit-10.0.130-0/lib/

(env37_t5) danielk@aristo-server1 ~ $ nvidia-smi 
Fri Jan 24 14:24:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro GV100        On   | 00000000:01:00.0 Off |                  Off |
| 65%   82C    P2   159W / 250W |  11154MiB / 32478MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:02:00.0 Off |                  Off |
| 33%   49C    P8    14W / 260W |   6830MiB / 48571MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

FYI @nalourie-ai2

google-research / text-to-text-transfer-transformer

CUDA_ERROR_OUT_OF_MEMORY: out of memory (on a GPU) #57