GPU training: program is "killed" after "XLA compilation"

danyaljj commented 4 years ago

I tried training the code on a GPU, after including the changes made earlier today, I am having a memory issues. Just after 2020-01-08 16:11:33.715292: I tensorflow/compiler/jit/xla_compilation_cache.cc:238] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. message, the program crashes with a Killed message.

Here is an extended log for your attention:

> $ t5_mesh_transformer    --model_dir="danielk-files/models"   --t5_tfds_data_dir="danielk-files"   --gin_file="dataset.gin"   --gin_param="utils.run.mesh_shape = 'model:2,batch:1'"   --gin_param="utils.run.mesh_devices = ['gpu:0','gpu:1']"   --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'"   --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin" 
.
. 
.
Colocation members, user-requested devices, and framework assigned devices, if any:
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Initializer/random_uniform/shape (Const)
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Initializer/random_uniform/min (Const)
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Initializer/random_uniform/max (Const)
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Initializer/random_uniform/sub (Sub)
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Initializer/random_uniform/mul (Mul)
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Initializer/random_uniform (Add)
  decoder/block_005/layer_001/EncDecAttention/o_slice_1 (VariableV2) /device:GPU:1
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/Assign (Assign) /device:GPU:1
  decoder/block_005/layer_001/EncDecAttention/o_slice_1/read (Identity) /device:GPU:1
  decoder/block_005/layer_001/EncDecAttention/o_1/parallel_1_1/Assign (Assign) /device:GPU:1
  assign_1/parallel_1_96/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.215915: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Initializer/random_uniform/shape (Const)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Initializer/random_uniform/min (Const)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Initializer/random_uniform/max (Const)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Initializer/random_uniform/sub (Sub)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Initializer/random_uniform/mul (Mul)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Initializer/random_uniform (Add)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0 (VariableV2) /device:GPU:0
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/Assign (Assign) /device:GPU:0
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_0/read (Identity) /device:GPU:0
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_1/parallel_0_1/Assign (Assign) /device:GPU:0
  assign_1/parallel_0_97/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.216672: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Initializer/random_uniform/shape (Const)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Initializer/random_uniform/min (Const)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Initializer/random_uniform/max (Const)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Initializer/random_uniform/sub (Sub)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Initializer/random_uniform/mul (Mul)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Initializer/random_uniform (Add)
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1 (VariableV2) /device:GPU:1
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/Assign (Assign) /device:GPU:1
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_slice_1/read (Identity) /device:GPU:1
  decoder/block_005/layer_002/DenseReluDense/wi/kernel_1/parallel_1_1/Assign (Assign) /device:GPU:1
  assign_1/parallel_1_97/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.217428: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Initializer/random_uniform/shape (Const)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Initializer/random_uniform/min (Const)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Initializer/random_uniform/max (Const)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Initializer/random_uniform/sub (Sub)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Initializer/random_uniform/mul (Mul)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Initializer/random_uniform (Add)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0 (VariableV2) /device:GPU:0
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/Assign (Assign) /device:GPU:0
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_0/read (Identity) /device:GPU:0
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_1/parallel_0_1/Assign (Assign) /device:GPU:0
  assign_1/parallel_0_98/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.218184: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Initializer/random_uniform/shape (Const)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Initializer/random_uniform/min (Const)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Initializer/random_uniform/max (Const)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Initializer/random_uniform/sub (Sub)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Initializer/random_uniform/mul (Mul)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Initializer/random_uniform (Add)
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1 (VariableV2) /device:GPU:1
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/Assign (Assign) /device:GPU:1
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_slice_1/read (Identity) /device:GPU:1
  decoder/block_005/layer_002/DenseReluDense/wo/kernel_1/parallel_1_1/Assign (Assign) /device:GPU:1
  assign_1/parallel_1_98/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.254151: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/shared/embedding_slot_vr_slice_0/Initializer/random_uniform/shape (Const)
  stacked/shared/embedding_slot_vr_slice_0/Initializer/random_uniform/min (Const)
  stacked/shared/embedding_slot_vr_slice_0/Initializer/random_uniform/max (Const)
  stacked/shared/embedding_slot_vr_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/shared/embedding_slot_vr_slice_0/Initializer/random_uniform/sub (Sub)
  stacked/shared/embedding_slot_vr_slice_0/Initializer/random_uniform/mul (Mul)
  stacked/shared/embedding_slot_vr_slice_0/Initializer/random_uniform (Add)
  stacked/shared/embedding_slot_vr_slice_0 (VariableV2) /device:GPU:0
  stacked/shared/embedding_slot_vr_slice_0/Assign (Assign) /device:GPU:0
  stacked/shared/embedding_slot_vr_slice_0/read (Identity) /device:GPU:0
  stacked/shared/embedding_slot_vr/parallel_0_1/Assign (Assign) /device:GPU:0
  assign/parallel_0/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.254968: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/shared/embedding_slot_vr_slice_1/Initializer/random_uniform/shape (Const)
  stacked/shared/embedding_slot_vr_slice_1/Initializer/random_uniform/min (Const)
  stacked/shared/embedding_slot_vr_slice_1/Initializer/random_uniform/max (Const)
  stacked/shared/embedding_slot_vr_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/shared/embedding_slot_vr_slice_1/Initializer/random_uniform/sub (Sub)
  stacked/shared/embedding_slot_vr_slice_1/Initializer/random_uniform/mul (Mul)
  stacked/shared/embedding_slot_vr_slice_1/Initializer/random_uniform (Add)
  stacked/shared/embedding_slot_vr_slice_1 (VariableV2) /device:GPU:1
  stacked/shared/embedding_slot_vr_slice_1/Assign (Assign) /device:GPU:1
  stacked/shared/embedding_slot_vr_slice_1/read (Identity) /device:GPU:1
  stacked/shared/embedding_slot_vr/parallel_1_1/Assign (Assign) /device:GPU:1
  assign/parallel_1/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.255704: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  shared/embedding_slot_vc_slice_0/Initializer/random_uniform/shape (Const)
  shared/embedding_slot_vc_slice_0/Initializer/random_uniform/min (Const)
  shared/embedding_slot_vc_slice_0/Initializer/random_uniform/max (Const)
  shared/embedding_slot_vc_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  shared/embedding_slot_vc_slice_0/Initializer/random_uniform/sub (Sub)
  shared/embedding_slot_vc_slice_0/Initializer/random_uniform/mul (Mul)
  shared/embedding_slot_vc_slice_0/Initializer/random_uniform (Add)
  shared/embedding_slot_vc_slice_0 (VariableV2) /device:GPU:0
  shared/embedding_slot_vc_slice_0/Assign (Assign) /device:GPU:0
  shared/embedding_slot_vc_slice_0/read (Identity) /device:GPU:0
  shared/embedding_slot_vc_1/parallel_0_1/Assign (Assign) /device:GPU:0
  assign/parallel_0_1/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.256476: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  shared/embedding_slot_vc_slice_1/Initializer/random_uniform/shape (Const)
  shared/embedding_slot_vc_slice_1/Initializer/random_uniform/min (Const)
  shared/embedding_slot_vc_slice_1/Initializer/random_uniform/max (Const)
  shared/embedding_slot_vc_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  shared/embedding_slot_vc_slice_1/Initializer/random_uniform/sub (Sub)
  shared/embedding_slot_vc_slice_1/Initializer/random_uniform/mul (Mul)
  shared/embedding_slot_vc_slice_1/Initializer/random_uniform (Add)
  shared/embedding_slot_vc_slice_1 (VariableV2) /device:GPU:1
  shared/embedding_slot_vc_slice_1/Assign (Assign) /device:GPU:1
  shared/embedding_slot_vc_slice_1/read (Identity) /device:GPU:1
  shared/embedding_slot_vc_1/parallel_1_1/Assign (Assign) /device:GPU:1
  assign/parallel_1_1/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.257669: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Initializer/random_uniform/shape (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Initializer/random_uniform/min (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Initializer/random_uniform/max (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Initializer/random_uniform/sub (Sub)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Initializer/random_uniform/mul (Mul)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Initializer/random_uniform (Add)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0 (VariableV2) /device:GPU:0
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/Assign (Assign) /device:GPU:0
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_0/read (Identity) /device:GPU:0
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr/parallel_0_1/Assign (Assign) /device:GPU:0
  assign/parallel_0_2/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.258527: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Initializer/random_uniform/shape (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Initializer/random_uniform/min (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Initializer/random_uniform/max (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Initializer/random_uniform/sub (Sub)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Initializer/random_uniform/mul (Mul)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Initializer/random_uniform (Add)
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1 (VariableV2) /device:GPU:1
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/Assign (Assign) /device:GPU:1
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr_slice_1/read (Identity) /device:GPU:1
  stacked/encoder/block_000/layer_000/SelfAttention/q_slot_vr/parallel_1_1/Assign (Assign) /device:GPU:1
  assign/parallel_1_2/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.260295: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Initializer/random_uniform/shape (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Initializer/random_uniform/min (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Initializer/random_uniform/max (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Initializer/random_uniform/sub (Sub)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Initializer/random_uniform/mul (Mul)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Initializer/random_uniform (Add)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0 (VariableV2) /device:GPU:0
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/Assign (Assign) /device:GPU:0
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_0/read (Identity) /device:GPU:0
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v/parallel_0_1/Assign (Assign) /device:GPU:0
  assign/parallel_0_3/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.261051: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform/shape (Const)

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform/shape (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform/min (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform/max (Const)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform/sub (Sub)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform/mul (Mul)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Initializer/random_uniform (Add)
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1 (VariableV2) /device:GPU:1
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/Assign (Assign) /device:GPU:1
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v_slice_1/read (Identity) /device:GPU:1
  stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias_slot_v/parallel_1_1/Assign (Assign) /device:GPU:1
  assign/parallel_1_3/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.262048: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Initializer/random_uniform/shape (Const)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Initializer/random_uniform/min (Const)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Initializer/random_uniform/max (Const)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Initializer/random_uniform/sub (Sub)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Initializer/random_uniform/mul (Mul)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Initializer/random_uniform (Add)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0 (VariableV2) /device:GPU:0
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/Assign (Assign) /device:GPU:0
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_0/read (Identity) /device:GPU:0
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc/parallel_0_1/Assign (Assign) /device:GPU:0
  assign/parallel_0_4/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.262775: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Initializer/random_uniform/shape (Const)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Initializer/random_uniform/min (Const)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Initializer/random_uniform/max (Const)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Initializer/random_uniform/sub (Sub)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Initializer/random_uniform/mul (Mul)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Initializer/random_uniform (Add)
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1 (VariableV2) /device:GPU:1
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/Assign (Assign) /device:GPU:1
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc_slice_1/read (Identity) /device:GPU:1
  stacked/encoder/block_000/layer_001/DenseReluDense/wi/kernel_slot_vc/parallel_1_1/Assign (Assign) /device:GPU:1
  assign/parallel_1_4/Assign (Assign) /device:GPU:1

2020-01-08 16:10:35.288719: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  decoder/final_layer_norm/scale_slot_v_slice_0/Initializer/random_uniform/shape (Const)
  decoder/final_layer_norm/scale_slot_v_slice_0/Initializer/random_uniform/min (Const)
  decoder/final_layer_norm/scale_slot_v_slice_0/Initializer/random_uniform/max (Const)
  decoder/final_layer_norm/scale_slot_v_slice_0/Initializer/random_uniform/RandomUniform (RandomUniform)
  decoder/final_layer_norm/scale_slot_v_slice_0/Initializer/random_uniform/sub (Sub)
  decoder/final_layer_norm/scale_slot_v_slice_0/Initializer/random_uniform/mul (Mul)
  decoder/final_layer_norm/scale_slot_v_slice_0/Initializer/random_uniform (Add)
  decoder/final_layer_norm/scale_slot_v_slice_0 (VariableV2) /device:GPU:0
  decoder/final_layer_norm/scale_slot_v_slice_0/Assign (Assign) /device:GPU:0
  decoder/final_layer_norm/scale_slot_v_slice_0/read (Identity) /device:GPU:0
  decoder/final_layer_norm/scale_slot_v_1/parallel_0_1/Assign (Assign) /device:GPU:0
  assign/parallel_0_5/Assign (Assign) /device:GPU:0

2020-01-08 16:10:35.289603: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

Colocation members, user-requested devices, and framework assigned devices, if any:
  decoder/final_layer_norm/scale_slot_v_slice_1/Initializer/random_uniform/shape (Const)
  decoder/final_layer_norm/scale_slot_v_slice_1/Initializer/random_uniform/min (Const)
  decoder/final_layer_norm/scale_slot_v_slice_1/Initializer/random_uniform/max (Const)
  decoder/final_layer_norm/scale_slot_v_slice_1/Initializer/random_uniform/RandomUniform (RandomUniform)
  decoder/final_layer_norm/scale_slot_v_slice_1/Initializer/random_uniform/sub (Sub)
  decoder/final_layer_norm/scale_slot_v_slice_1/Initializer/random_uniform/mul (Mul)
  decoder/final_layer_norm/scale_slot_v_slice_1/Initializer/random_uniform (Add)
  decoder/final_layer_norm/scale_slot_v_slice_1 (VariableV2) /device:GPU:1
  decoder/final_layer_norm/scale_slot_v_slice_1/Assign (Assign) /device:GPU:1
  decoder/final_layer_norm/scale_slot_v_slice_1/read (Identity) /device:GPU:1
  decoder/final_layer_norm/scale_slot_v_1/parallel_1_1/Assign (Assign) /device:GPU:1
  assign/parallel_1_5/Assign (Assign) /device:GPU:1

INFO:tensorflow:Running local_init_op.
I0108 16:10:37.353017 140685207750400 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0108 16:10:37.880563 140685207750400 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Before copy master to slices.
I0108 16:10:38.634351 140685207750400 ops.py:5541] Before copy master to slices.
INFO:tensorflow:Done with copy master to slices.
I0108 16:10:39.607687 140685207750400 ops.py:5543] Done with copy master to slices.
INFO:tensorflow:Saving checkpoints for 0 into danielk-files/models/model.ckpt.
I0108 16:10:51.266983 140685207750400 basic_session_run_hooks.py:606] Saving checkpoints for 0 into danielk-files/models/model.ckpt.
INFO:tensorflow:Before Save.
I0108 16:10:51.276291 140685207750400 ops.py:5516] Before Save.
INFO:tensorflow:About to write a checkpoint
I0108 16:10:52.409570 140685207750400 ops.py:5518] About to write a checkpoint
INFO:tensorflow:danielk-files/models/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
I0108 16:10:53.351364 140685207750400 checkpoint_management.py:95] danielk-files/models/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
INFO:tensorflow:Done writing checkpoint.
I0108 16:10:55.473980 140685207750400 ops.py:5521] Done writing checkpoint.
import feature targets[[[7072 1 7072 1 7072 1 7072 1 7072 1 7072 1 7072 1 7072 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][59 834 15 1169 15592 1 7072 1 7072 1 59 834 15 1169 15592 1 7072 1 7072 1 7072 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]import feature targets_segmentation[[[1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][1 1 1 1 1 1 2 2 3 3 4 4 4 4 4 4 5 5 6 6 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...]]...]

import feature inputs[[[3 51 52 102 75 7142 536 10 86 3 9 2493 865 3 6 3 88 243 34 4283 112 596 141 9717 3 9 720 710 3 5 7142 357 10 3969 40 18276 28325 26 16 3 9 2493 4683 2818 24 112 563 164 43 9717 96 3 9 720 710 3 5 96 1 3 51 52 102 75 7142 536 10 96 30628 65 2994 186 203 13 3014 3 233 581 385 42 150 2259 3 6 96 10261 1836 243 16 8 2493 3 5 7142 357 10 96 30628 65 2994 186 203 13 3014 3190 4148 8852 581 385 42 150 2259 3 6 96 243 3 5494 2146 15702 6187 10261 1836 3 5 1 3 51 52 102 75 7142 536 10 25103 3 6 8 1113 13 1473 7 6914 19 5657 30587 7 190 8 1719 3 5 7142 357 10 25103 3 6 1473 3 31 7 6914 33 5657 30587 7 190 8 1719 3 5 1 3 51 52 102 75 7142 536 10 907 641 65 1866 8 690 1514 6154 770 16 17524 21 3586 8 166 1751 13 4311 8874 3 5 7142 357 10 907 65 1866 1514 6154 770 16 17524 21 12385 12 942 4311 8874 3 5 1 3 51 52 102 75 7142 536 10 216 3 9925 38 46 1038 4297 8211 30 2645 6834 7 12 36 3 9 14625 2378 11 37 101 1639 222 7505 3 5 7142 357 10 71 9396 1424 8211 113 1279 30 1267 379 2645 6834 7 304 493 3 9 14625 2378 3 58 1 3 51 52 102 75 7142 536 10 23066 43 4313 10209 12778 13485 30 3 10363 3972 7159 1296 24 164 554 10475 32 7 15 6917 12 824 6716 3 5 7142 357 10 9864 11 112 372 43 4313 10209 12778 13485 24 9296 1137 42 554 10475 32 7 15 6917 12 824 6716 3 5 1 3 51 52 102 75 7142 536 10 12394 4794 25394 11385 7 16 1798 3370 2213 4599 3 31 37 29210 127 3 31 18786 21 8 3 4060 189 2041 18050 3 31 7 29952 21670 13 1718 5396 6751 262 1014 21537 3 5 7142 357 10 1881 18 279 2741 7 2213 4599 3 31 8 29210 127 3 31 18786 5978 7 16 8 18050 3 31 7 29952 21670 13 1718 5396 6751 262 1014 21537 1701 3 5 1 3 51 52 102 75 7142 536 10 86 119 1234 3 6 17240 6610 19 3 30273 26 12 726 21 3 476 3205 3426 3 31 7 9953 581 2900 20055 17240 3 5 7142 357 10 9046 6402 3 6 17240 6610 56 36 2418 53 8 2876 21 3 476 3205 3 31 9953 581 2900 20055 17240 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][3 51 52 102 75 7142 536 10 71 903 13 8 20 900 6682 646 1187 116 3 9 388 10719 112 443 190 3 9 3 17208 7208 512 16 4625 20268 30 2875 3 5 7142 357 10 37 643 13 3 9 7584 7797 365 3 9 4459 3 2046 102 16 851 13 3 9 443 24 3 102 22411 190 3 9 3 17208 7208 512 16 4625 20268 3 5 1 3 51 52 102 75 7142 536 10 1363 1793 23 19001 23 3 60 3007 5100 1363 2974 32 5208 3 6 2145 112 21029 130 96 21001 96 3 5 7142 357 10 1793 23 19001 23 1219 19644 44 112 828 24 2974 32 5208 3 31 7 21029 130 96 21001 3 5 96 1 3 51 52 102 75 7142 536 10 24583 4300 1390 12 7464 19852 1213 2747 23620 13335 21 81 1514 3 4060 770 16 1723 3 6 8 688 243 1701 3 5 7142 357 10 24583 4300 10052 5 19 3 19031 2747 23620 13335 3937 3 5 3 6 3 9 19852 1213 8106 13 331 18 20393 889 3 6 21 3241 1514 3 4060 770 3 6 8 688 243 1701 3 5 1 3 51 52 102 75 7142 536 10 25394 243 30 2875 24 66 898 4627 724 13928 11 133 36 5285 21 4845 11 4798 3 5 7142 357 10 216 243 8 20395 3 31 7 4627 56 36 19257 11 5285 21 4845 11 4798 3 5 1 3 51 52 102 75 7142 536 10 37 29 8 5015 54 1520 430 356 13 452 3507 7 30 7954 3 287 4246 53 1358 3 6 8 5191 243 3 5 7142 357 10 299 8 5191 3 6 5181 1060 3 6 243 8 5015 54 1520 430 356 13 452 3507 7 30 7954 3 287 4246 53 1358 227 2239 3 5 1 3 51 52 102 75 7142 536 10 486 709 2838 797 12673 43 118 4792 16 1041 437 8905 10126 779 4719 147 30 932 209 3 5 7142 357 10 886 386 9611 797 11 2390 12673 43 118 4792 437 8905 10126 779 4719 147 16 7457 30 932 209 3 5 1 3 51 52 102 75 7142 536 10 37 5923 3271 7 13 1473 3 6 4623 11 662 2069 6578 9352 43 1736 16 14465 11 3754 3 9 1487 21 70 3518 1034 563 53 3 5 7142 357 10 37 3427 6323 7 13 1473 3 6 4623 11 662 2808 6578 1440 3814 10663 30 2818 24 356 1390 16 4644 21 3 9 307 18 9 13106 3518 1181 18 14389 2050 16 412 172 346 2168 5627 3 5 1 0 0 0 0 0 0 0 0 0 0...]]...]
import feature inputs_segmentation[[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0][1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0 0...]]...]
2020-01-08 16:11:33.715292: I tensorflow/compiler/jit/xla_compilation_cache.cc:238] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Killed

My GPU specs:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Wed Jan  8 16:25:55 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro GV100        On   | 00000000:01:00.0 Off |                  Off |
| 29%   41C    P2    25W / 250W |      0MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:02:00.0 Off |                  Off |
| 33%   28C    P8    11W / 260W |      0MiB / 48571MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Memory info:

top - 16:29:15 up 47 days,  7:31,  4 users,  load average: 0.11, 0.54, 1.95
Tasks: 648 total,   1 running, 647 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.7 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem :   62.842 total,   61.903 free,    0.506 used,    0.433 buff/cache
GiB Swap:    8.000 total,    7.730 free,    0.270 used.   61.731 avail Mem

and pip packages:

$ pip list
Package                       Version
----------------------------- ----------
absl-py                       0.9.0
alabaster                     0.7.12
allennlp                      0.9.0
astor                         0.8.1
attrs                         19.3.0
Babel                         2.8.0
blis                          0.2.4
boto                          2.49.0
boto3                         1.10.49
botocore                      1.13.49
cachetools                    4.0.0
certifi                       2019.11.28
chardet                       3.0.4
Click                         7.0
conllu                        1.3.1
cycler                        0.10.0
cymem                         2.0.3
dill                          0.3.1.1
distro                        1.4.0
docutils                      0.15.2
editdistance                  0.5.3
flaky                         3.6.1
Flask                         1.1.1
Flask-Cors                    3.0.8
ftfy                          5.6
future                        0.18.2
gast                          0.2.2
gevent                        1.4.0
gin-config                    0.3.0
google-api-core               1.15.0
google-api-python-client      1.7.11
google-auth                   1.10.0
google-auth-httplib2          0.0.3
google-cloud-core             1.1.0
google-cloud-storage          1.24.1
google-compute-engine         2.8.13
google-pasta                  0.1.8
google-resumable-media        0.5.0
googleapis-common-protos      1.6.0
greenlet                      0.4.15
grpcio                        1.26.0
h5py                          2.10.0
httplib2                      0.15.0
idna                          2.8
imagesize                     1.2.0
importlib-metadata            1.3.0
itsdangerous                  1.1.0
Jinja2                        2.10.3
jmespath                      0.9.4
joblib                        0.14.1
jsonnet                       0.14.0
jsonpickle                    1.2
Keras-Applications            1.0.8
Keras-Preprocessing           1.1.0
kiwisolver                    1.1.0
Markdown                      3.1.1
MarkupSafe                    1.1.1
matplotlib                    3.1.2
mesh-tensorflow               0.1.9
more-itertools                8.0.2
murmurhash                    1.0.2
nltk                          3.4.5
numpy                         1.18.1
numpydoc                      0.9.2
oauth2client                  4.1.3
opt-einsum                    3.1.0
overrides                     2.8.0
packaging                     20.0
pandas                        0.25.3
parsimonious                  0.8.1
pip                           19.3.1
plac                          0.9.6
pluggy                        0.13.1
portalocker                   1.5.2
preshed                       2.0.1
promise                       2.3
protobuf                      3.11.2
py                            1.8.1
pyasn1                        0.4.8
pyasn1-modules                0.2.7
Pygments                      2.5.2
pyparsing                     2.4.6
pytest                        5.3.2
python-dateutil               2.8.1
pytorch-pretrained-bert       0.6.2
pytorch-transformers          1.1.0
pytz                          2019.3
regex                         2020.1.8
requests                      2.22.0
responses                     0.10.9
rouge-score                   0.0.3
rsa                           4.0
s3transfer                    0.2.1
sacrebleu                     1.4.3
scikit-learn                  0.22.1
scipy                         1.4.1
sentencepiece                 0.1.85
setuptools                    44.0.0
six                           1.13.0
snowballstemmer               2.0.0
spacy                         2.1.9
Sphinx                        2.3.1
sphinxcontrib-applehelp       1.0.1
sphinxcontrib-devhelp         1.0.1
sphinxcontrib-htmlhelp        1.0.2
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.2
sphinxcontrib-serializinghtml 1.1.3
sqlparse                      0.3.0
srsly                         1.0.1
t5                            0.1.7
tensorboard                   1.15.0
tensorboardX                  2.0
tensorflow                    1.15.0
tensorflow-datasets           1.3.2
tensorflow-estimator          1.15.1
tensorflow-metadata           0.21.0
tensorflow-text               1.15.0rc0
termcolor                     1.1.0
thinc                         7.0.8
torch                         1.3.1
tqdm                          4.41.1
typing                        3.7.4.1
Unidecode                     1.1.1
uritemplate                   3.0.1
urllib3                       1.25.7
wasabi                        0.6.0
wcwidth                       0.1.8
Werkzeug                      0.16.0
wheel                         0.33.6
word2number                   1.1
wrapt                         1.11.2
zipp                          0.6.0

adarob commented 4 years ago

It looks like it's falling back to CPU for some reason. Do you see anything earlier in the logs that might tell you why?

danyaljj commented 4 years ago

hmm ... here is the top command right before it crashes:

which shoes that the program is using lots of CPU:

28352 danielk   20   0 40.123g 0.022t 464504 S 789.7 36.3   7:15.07 t5_mesh_transfo

However, there are several GPU threads too, which baffles me (some of these might be stale, from previous tries):

danyaljj commented 4 years ago

In between logs I see: Current candidate devices are [ /job:localhost/replica:0/task:0/device:CPU:0] and Assign: CPU.

Extended log:

2020-01-08 16:36:54.382904: W tensorflow/core/common_runtime/colocation_graph.cc:983] Failed to place the graph without changing the devices of some resources. Some of the operations 
(that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:1' assigned_device_name_='' resource_device_name_='/device:GPU:1' supported_device_types_=[CPU] possible
_devices_=[]
Assign: CPU
RandomUniform: CPU XLA_CPU XLA_GPU
Const: CPU XLA_CPU XLA_GPU
Mul: CPU XLA_CPU XLA_GPU
Sub: CPU XLA_CPU XLA_GPU
Add: CPU XLA_CPU XLA_GPU
Identity: CPU XLA_CPU XLA_GPU
VariableV2: CPU

adarob commented 4 years ago

You can see from the errors you're getting that the ops were placed on CPU instead of GPU. For example:

Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Assign: CPU

I saw this happen when I was testing and it seemed to be due to issues with my CUDA setup. I was on a VM so I basically started from scratch and got it to work.

adarob commented 4 years ago

I think it would be useful to see what's logged before you get to those relocations.

danyaljj commented 4 years ago

Log right at the beginning:

$ t5_mesh_transformer    --model_dir="danielk-files/models"   --t5_tfds_data_dir="danielk-files"   --gin_file="dataset.gin"   --gin_param="utils.run.mesh_shape = 'model:1,batch:1'"   --gin_param="utils.run.mesh_devices = ['gpu:1']"   --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'"   --gin_file="gs://t5-data/pretrained_models/small/operative_confi
g.gin"   --gin_param="batch_size=2"
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
INFO:tensorflow:model_type=bitransformer
I0108 16:43:30.913287 139962466866944 utils.py:1664] model_type=bitransformer
INFO:tensorflow:mode=train
I0108 16:43:30.913425 139962466866944 utils.py:1665] mode=train
INFO:tensorflow:sequence_length={'inputs': 512, 'targets': 512}
I0108 16:43:30.913482 139962466866944 utils.py:1666] sequence_length={'inputs': 512, 'targets': 512}
INFO:tensorflow:batch_size=2048
I0108 16:43:30.913529 139962466866944 utils.py:1667] batch_size=2048
INFO:tensorflow:train_steps=1000000000
I0108 16:43:30.913570 139962466866944 utils.py:1668] train_steps=1000000000  
INFO:tensorflow:mesh_shape=model:1,batch:1
I0108 16:43:30.913610 139962466866944 utils.py:1669] mesh_shape=model:1,batch:1
INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
I0108 16:43:30.913662 139962466866944 utils.py:1670] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
INFO:tensorflow:Building TPUConfig with tpu_job_name=None
I0108 16:43:30.913738 139962466866944 utils.py:1685] Building TPUConfig with tpu_job_name=None
INFO:tensorflow:Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4
abdf4b6a0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_con
figuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
I0108 16:43:30.916487 139962466866944 estimator.py:212] Using config: {'_model_dir': 'danielk-files/models', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4
abdf4b6a0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_con
figuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
I0108 16:43:30.916929 139962466866944 tpu_context.py:220] _TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
W0108 16:43:30.916996 139962466866944 tpu_context.py:222] eval_on_tpu ignored because use_tpu is False.
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
W0108 16:43:30.924340 139962466866944 deprecation.py:506] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0108 16:43:30.924594 139962466866944 deprecation.py:323] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
I0108 16:43:30.931804 139962466866944 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0108 16:43:31.075816 139962466866944 dataset_builder.py:193] Overwrite dataset info from restored data version.
I0108 16:43:31.082559 139962466866944 dataset_builder.py:273] Reusing dataset glue (danielk-files/glue/mrpc/0.0.2)
I0108 16:43:31.083067 139962466866944 dataset_builder.py:434] Constructing tf.data.Dataset for split train, from danielk-files/glue/mrpc/0.0.2
2020-01-08 16:43:31.941356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-08 16:43:31.995155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-08 16:43:31.997136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:02:00.0
2020-01-08 16:43:31.997226: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997275: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997402: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997453: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997592: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:31.997642: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2020-01-08 16:43:32.042586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-08 16:43:32.042640: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
W0108 16:43:33.304034 139962466866944 deprecation.py:323] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/dataset.py:513: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
INFO:tensorflow:Calling model_fn.
I0108 16:43:34.641363 139962466866944 estimator.py:1148] Calling model_fn.
INFO:tensorflow:Running train on CPU
I0108 16:43:34.641528 139962466866944 tpu_estimator.py:3124] Running train on CPU
INFO:tensorflow:feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
I0108 16:43:34.642672 139962466866944 utils.py:374] feature inputs : Tensor("Reshape:0", shape=(1, 2048, 512), dtype=int32)
WARNING:tensorflow:From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:

W0108 16:43:34.642765 139962466866944 deprecation.py:323] From /home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py:376: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:

I see some warnings like libcudart.so.10.0: cannot open shared object file: No such file or directory. Could be that?

adarob commented 4 years ago

Yes, this is the problem I had at well. It may be because you have 10.1 installed...

danyaljj commented 4 years ago

ah okay!

Could you elaborate on this?:

It may be because you have 10.1 installed...

danyaljj commented 4 years ago

Btw:

$ cat /usr/local/cuda/version.txt
CUDA Version 9.0.176

adarob commented 4 years ago

It's looking for libcu*.so.10.0 but (according to your nvidia-smi printout at least) you have v10.1 which probably names the files libcu*.so.10.1.

Have a look at https://github.com/tensorflow/tensorflow/issues/26182

danyaljj commented 4 years ago

Thanks!

adarob commented 4 years ago

FYI, it looks like TF 2.1.0 is compatible with CUDA 10.1 according to https://www.tensorflow.org/install/source#tested_build_configurations

danyaljj commented 4 years ago

FYI, it looks like TF 2.1.0 is compatible with CUDA 10.1 according to ...

This is a bit tricky because t5 has an explicit requirement on earlier tensorflow versions.

For those who have the same issue, I did use a conda environment to install the following packages:

conda install cudatoolkit
conda install cudnn

and

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/YOURUSERNAME/anaconda3/pkgs/cudatoolkit-10.X.Y-Z/lib/

Now after starting the code I see:

2020-01-09 10:56:27.543353: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-01-09 10:56:27.544165: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-09 10:56:27.544862: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-01-09 10:56:27.545038: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-01-09 10:56:27.545970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-01-09 10:56:27.546671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-01-09 10:56:27.548886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

google-research / text-to-text-transfer-transformer

GPU training: program is "killed" after "XLA compilation" #40