google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.11k stars 753 forks source link

DataLossError in Evaluate Step #252

Closed masterzzzen closed 4 years ago

masterzzzen commented 4 years ago

Hi, I'm fine tuning the "small" model a cloud TPU for 10 steps only. When I got to the Evaluate step, I got the following error:

cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.42.100.162:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.42.100.162:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.42.100.162:8470', '_evaluation_master': 'grpc://10.42.100.162:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x7ff5b7ac5d30>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:absl:Load dataset info from gs://pzeng-text-summarization/data/trivia_qa/unfiltered.nocontext/1.1.0
INFO:absl:Reusing dataset trivia_qa (gs://pzeng-text-summarization/data/trivia_qa/unfiltered.nocontext/1.1.0)
INFO:absl:Constructing tf.data.Dataset for split validation, from gs://pzeng-text-summarization/data/trivia_qa/unfiltered.nocontext/1.1.0
INFO:tensorflow:Checkpoint path gs://pzeng-text-summarization/models/small/model.ckpt-1000000
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Querying Tensorflow master (grpc://10.42.100.162:8470) for TPU system metadata.
INFO:tensorflow:Initializing TPU system (master: grpc://10.42.100.162:8470) to fetch topology for model parallelism. This might take a while.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, -7994874036451659721)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, -3523617659788618067)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, -4048149551514886005)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 6093926613079630450)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 7325952127528587010)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 3444661511630420339)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 9115199638201821400)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, -3608199526541325998)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, -8898913299861370745)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 8000259732499624943)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 453783514470826081)
INFO:tensorflow:Calling model_fn.
INFO:absl:Load dataset info from gs://pzeng-text-summarization/data/trivia_qa/unfiltered.nocontext/1.1.0
INFO:absl:Reusing dataset trivia_qa (gs://pzeng-text-summarization/data/trivia_qa/unfiltered.nocontext/1.1.0)
INFO:absl:Constructing tf.data.Dataset for split validation, from gs://pzeng-text-summarization/data/trivia_qa/unfiltered.nocontext/1.1.0
INFO:tensorflow:enable_2d_tiling: False
INFO:tensorflow:num_cores_per_replica: 1
INFO:tensorflow:computation_shape: [1, 1, 1]
INFO:tensorflow:num_replicas: 8
INFO:tensorflow:device_assignment.topology.device_coordinates: [[[0 0 0]
  [0 0 1]
  [1 0 0]
  [1 0 1]
  [0 1 0]
  [0 1 1]
  [1 1 0]
  [1 1 1]]]
INFO:tensorflow:device_assignment.core_assignment: [[[0 0 0]]

 [[0 0 1]]

 [[0 1 0]]

 [[0 1 1]]

 [[1 0 0]]

 [[1 0 1]]

 [[1 1 0]]

 [[1 1 1]]]
WARNING:tensorflow:SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
INFO:tensorflow:SimdMeshImpl init: Shape[batch=8] LayoutRules{('heads', 'model'), ('experts', 'batch'), ('d_ff', 'model'), ('vocab', 'model'), ('batch', 'batch'), ('ensemble', 'ensemble')}
INFO:tensorflow:Device Assignment: <tensorflow.python.tpu.device_assignment.DeviceAssignment object at 0x7ff5b63d77f0>
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
INFO:tensorflow:Create pnum_tensor
INFO:tensorflow:Casting <dtype: 'int32'> to float32 for allreduce
INFO:tensorflow:Casting <dtype: 'int32'> to float32 for allreduce
INFO:tensorflow:Casting <dtype: 'int32'> to float32 for allreduce
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/k                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/o                size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/q                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/v                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_000/layer_002/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable decoder/block_000/layer_002/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/k                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/o                size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/q                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/v                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_001/layer_002/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable decoder/block_001/layer_002/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/k                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/o                size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/q                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/v                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_002/layer_002/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable decoder/block_002/layer_002/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/k                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/o                size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/q                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/v                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_003/layer_002/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable decoder/block_003/layer_002/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/k                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/o                size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/q                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/v                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_004/layer_002/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable decoder/block_004/layer_002/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/k                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/o                size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/q                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/v                size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable decoder/block_005/layer_002/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable decoder/block_005/layer_002/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_000/layer_001/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable encoder/block_000/layer_001/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_001/layer_001/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable encoder/block_001/layer_001/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_002/layer_001/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable encoder/block_002/layer_001/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_003/layer_001/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable encoder/block_003/layer_001/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_004/layer_001/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable encoder/block_004/layer_001/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/k                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/o                  size 262144       slice_size 262144       Shape[heads=512, d_model=512]                               
INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/q                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/v                  size 262144       slice_size 262144       Shape[d_model=512, heads=512]                               
INFO:tensorflow:Variable encoder/block_005/layer_001/DenseReluDense/wi/kernel         size 1048576      slice_size 1048576      Shape[d_model=512, d_ff=2048]                               
INFO:tensorflow:Variable encoder/block_005/layer_001/DenseReluDense/wo/kernel         size 1048576      slice_size 1048576      Shape[d_ff=2048, d_model=512]                               
INFO:tensorflow:Variable shared/embedding                                             size 16449536     slice_size 16449536     Shape[vocab=32128, d_model=512]                             
INFO:tensorflow:Variable stacked/encoder/block_000/layer_000/SelfAttention/relative_attention_bias size 512          slice_size 512          Shape[stacked=2, heads=8, buckets=32]                       
INFO:tensorflow:    encoder/block_000/layer_000/SelfAttention/relative_attention_bias
INFO:tensorflow:    decoder/block_000/layer_000/SelfAttention/relative_attention_bias
INFO:tensorflow:Variable stacked/encoder/block_000/layer_000/layer_norm/scale         size 16384        slice_size 16384        Shape[stacked=32, d_model=512]                              
INFO:tensorflow:    encoder/block_000/layer_000/layer_norm/scale
INFO:tensorflow:    encoder/block_000/layer_001/layer_norm/scale
INFO:tensorflow:    encoder/block_001/layer_000/layer_norm/scale
INFO:tensorflow:    encoder/block_001/layer_001/layer_norm/scale
INFO:tensorflow:    encoder/block_002/layer_000/layer_norm/scale
INFO:tensorflow:    encoder/block_002/layer_001/layer_norm/scale
INFO:tensorflow:    encoder/block_003/layer_000/layer_norm/scale
INFO:tensorflow:    encoder/block_003/layer_001/layer_norm/scale
INFO:tensorflow:    encoder/block_004/layer_000/layer_norm/scale
INFO:tensorflow:    encoder/block_004/layer_001/layer_norm/scale
INFO:tensorflow:    encoder/block_005/layer_000/layer_norm/scale
INFO:tensorflow:    encoder/block_005/layer_001/layer_norm/scale
INFO:tensorflow:    encoder/final_layer_norm/scale
INFO:tensorflow:    decoder/block_000/layer_000/layer_norm/scale
INFO:tensorflow:    decoder/block_000/layer_001/layer_norm/scale
INFO:tensorflow:    decoder/block_000/layer_002/layer_norm/scale
INFO:tensorflow:    decoder/block_001/layer_000/layer_norm/scale
INFO:tensorflow:    decoder/block_001/layer_001/layer_norm/scale
INFO:tensorflow:    decoder/block_001/layer_002/layer_norm/scale
INFO:tensorflow:    decoder/block_002/layer_000/layer_norm/scale
INFO:tensorflow:    decoder/block_002/layer_001/layer_norm/scale
INFO:tensorflow:    decoder/block_002/layer_002/layer_norm/scale
INFO:tensorflow:    decoder/block_003/layer_000/layer_norm/scale
INFO:tensorflow:    decoder/block_003/layer_001/layer_norm/scale
INFO:tensorflow:    decoder/block_003/layer_002/layer_norm/scale
INFO:tensorflow:    decoder/block_004/layer_000/layer_norm/scale
INFO:tensorflow:    decoder/block_004/layer_001/layer_norm/scale
INFO:tensorflow:    decoder/block_004/layer_002/layer_norm/scale
INFO:tensorflow:    decoder/block_005/layer_000/layer_norm/scale
INFO:tensorflow:    decoder/block_005/layer_001/layer_norm/scale
INFO:tensorflow:    decoder/block_005/layer_002/layer_norm/scale
INFO:tensorflow:    decoder/final_layer_norm/scale
INFO:tensorflow:Trainable Variables            count: 99      Total size: 60506624         Total slice_size: 60506624       
INFO:tensorflow:All Variables                  count: 99      Total size: 60506624         Total slice_size: 60506624       
INFO:tensorflow:Counters:
allconcat: 2.36e+06
 allconcat/0: 2.36e+06
  allconcat/0/reshape_op: 2.36e+06
allreduce: 8
 allreduce/[0]: 8
  allreduce/[0]/reduce_op: 8
einsum: 1.26e+13
einsum_unique: 1.26e+13
output: 6.28e+10
 output/AddOperation: 1.01e+10
 output/BinaryOpWithBroadcasting: 1.05e+08
 output/Constant: 8.05e+08
 output/EinsumOperation: 2.32e+10
 output/ImportOperation: 1.31e+06
 output/MinMaxOperation: 2.49e+06
 output/OneHotOperation: 8.47e+09
 output/RangeOperation: 2.05e+03
 output/ReduceOperation: 4.19e+07
 output/ReshapeOperation: 5.71e+09
 output/ScalarAddOperation: 7.34e+06
 output/ScalarMultiplyOperation: 1.79e+08
 output/ShiftOperation: 1.31e+05
 output/SlicewiseOperation: 1.05e+10
 output/StackedVariable: 1.35e+05
 output/StopGradient: 2.42e+09
 output/UnstackOperation: 1.35e+05
 output/Variable: 4.84e+08
 output/WhileLoopOperation: 8.05e+08
output_unique: 6.23e+10
 output_unique/AddOperation: 1.01e+10
 output_unique/BinaryOpWithBroadcasting: 1.02e+08
 output_unique/Constant: 8.05e+08
 output_unique/EinsumOperation: 2.32e+10
 output_unique/ImportOperation: 1.64e+05
 output_unique/MinMaxOperation: 4.26e+05
 output_unique/OneHotOperation: 8.43e+09
 output_unique/RangeOperation: 256
 output_unique/ReduceOperation: 4.19e+07
 output_unique/ReshapeOperation: 5.71e+09
 output_unique/ScalarAddOperation: 4.59e+06
 output_unique/ScalarMultiplyOperation: 1.74e+08
 output_unique/ShiftOperation: 1.31e+05
 output_unique/SlicewiseOperation: 1.05e+10
 output_unique/StackedVariable: 1.69e+04
 output_unique/StopGradient: 2.42e+09
 output_unique/UnstackOperation: 1.69e+04
 output_unique/Variable: 6.05e+07
 output_unique/WhileLoopOperation: 8.05e+08
variables: 6.05e+07
 variables/trainable: 6.05e+07
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:TPU job name worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://pzeng-text-summarization/models/small/model.ckpt-1000000
INFO:tensorflow:prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
---------------------------------------------------------------------------
DataLossError                             Traceback (most recent call last)
<ipython-input-17-8af7c6ccbcf9> in <module>()
      3 model.eval(
      4     mixture_or_task_name="trivia_all",
----> 5     checkpoint_steps="all"
      6     # checkpoint_steps=1
      7 )

24 frames
/usr/local/lib/python3.6/dist-packages/t5/models/mtf_model.py in eval(self, mixture_or_task_name, checkpoint_steps, summary_dir, split)
    265     utils.eval_model(self.estimator(vocabulary), vocabulary,
    266                      self._sequence_length, self.batch_size, split,
--> 267                      self._model_dir, dataset_fn, summary_dir, checkpoint_steps)
    268 
    269   def finetune(self, mixture_or_task_name, finetune_steps, pretrained_model_dir,

/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py in eval_model(estimator, vocabulary, sequence_length, batch_size, dataset_split, model_dir, eval_dataset_fn, eval_summary_dir, eval_checkpoint_step)
   1300     tf.logging.info("Checkpoint path %s" % checkpoint_path)
   1301     global_step = int(get_step_from_checkpoint_path(checkpoint_path))
-> 1302     decodes = decode(estimator, input_fn, vocabulary, checkpoint_path)
   1303     for eval_dataset in eval_datasets:
   1304       # Extract the portion of decodes corresponding to this dataset

/usr/local/lib/python3.6/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1076       scope_info = " in scope '{}'".format(scope_str) if scope_str else ''
   1077       err_str = err_str.format(name, fn_or_cls, scope_info)
-> 1078       utils.augment_exception_message_and_reraise(e, err_str)
   1079 
   1080   return gin_wrapper

/usr/local/lib/python3.6/dist-packages/gin/utils.py in augment_exception_message_and_reraise(exception, message)
     47   if six.PY3:
     48     ExceptionProxy.__qualname__ = type(exception).__qualname__
---> 49     six.raise_from(proxy.with_traceback(exception.__traceback__), None)
     50   else:
     51     six.reraise(proxy, None, sys.exc_info()[2])

/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/gin/config.py in gin_wrapper(*args, **kwargs)
   1053 
   1054     try:
-> 1055       return fn(*new_args, **new_kwargs)
   1056     except Exception as e:  # pylint: disable=broad-except
   1057       err_str = ''

/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py in decode(estimator, input_fn, vocabulary, checkpoint_path)
    869 
    870   decodes = []
--> 871   for i, result in enumerate(result_iter):
    872     input_string = _maybe_detokenize(
    873         result["inputs"], inputs_vocabulary(vocabulary))

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in predict(self, input_fn, predict_keys, hooks, checkpoint_path, yield_single_examples)
   3124     finally:
   3125       rendezvous.record_done('prediction_loop')
-> 3126       rendezvous.raise_errors()
   3127 
   3128     rendezvous.record_done('prediction_loop')

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec)
    148       else:
    149         tf.compat.v1.logging.warn('Reraising captured error')
--> 150         six.reraise(typ, value, traceback)
    151 
    152     for k, (typ, value, traceback) in kept_errors:

/usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb)
    701             if value.__traceback__ is not tb:
    702                 raise value.with_traceback(tb)
--> 703             raise value
    704         finally:
    705             value = None

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in predict(self, input_fn, predict_keys, hooks, checkpoint_path, yield_single_examples)
   3118           hooks=hooks,
   3119           checkpoint_path=checkpoint_path,
-> 3120           yield_single_examples=yield_single_examples):
   3121         yield result
   3122     except Exception:  # pylint: disable=broad-except

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in predict(self, input_fn, predict_keys, hooks, checkpoint_path, yield_single_examples)
    627                 scaffold=estimator_spec.scaffold,
    628                 config=self._session_config),
--> 629             hooks=all_hooks) as mon_sess:
    630           while not mon_sess.should_stop():
    631             preds_evaluated = mon_sess.run(predictions)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py in __init__(self, session_creator, hooks, stop_grace_period_secs)
   1036         hooks,
   1037         should_recover=True,
-> 1038         stop_grace_period_secs=stop_grace_period_secs)
   1039 
   1040 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py in __init__(self, session_creator, hooks, should_recover, stop_grace_period_secs)
    747         stop_grace_period_secs=stop_grace_period_secs)
    748     if should_recover:
--> 749       self._sess = _RecoverableSession(self._coordinated_creator)
    750     else:
    751       self._sess = self._coordinated_creator.create_session()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py in __init__(self, sess_creator)
   1229     """
   1230     self._sess_creator = sess_creator
-> 1231     _WrappedSession.__init__(self, self._create_session())
   1232 
   1233   def _create_session(self):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py in _create_session(self)
   1234     while True:
   1235       try:
-> 1236         return self._sess_creator.create_session()
   1237       except _PREEMPTION_ERRORS as e:
   1238         logging.info(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py in create_session(self)
    900       """Creates a coordinated session."""
    901       # Keep the tf_sess for unit testing.
--> 902       self.tf_sess = self._session_creator.create_session()
    903       # We don't want coordinator to suppress any exception.
    904       self.coord = coordinator.Coordinator(clean_stop_exception_types=[])

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py in create_session(self)
    667         init_op=self._scaffold.init_op,
    668         init_feed_dict=self._scaffold.init_feed_dict,
--> 669         init_fn=self._scaffold.init_fn)
    670 
    671 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py in prepare_session(self, master, init_op, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config, init_feed_dict, init_fn)
    293         wait_for_checkpoint=wait_for_checkpoint,
    294         max_wait_secs=max_wait_secs,
--> 295         config=config)
    296     if not is_loaded_from_checkpoint:
    297       if init_op is None and not init_fn and self._local_init_op is None:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py in _restore_checkpoint(self, master, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config)
    207 
    208     if checkpoint_filename_with_path:
--> 209       saver.restore(sess, checkpoint_filename_with_path)
    210       return sess, True
    211 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py in restore(self, sess, save_path)
   1297       else:
   1298         sess.run(self.saver_def.restore_op_name,
-> 1299                  {self.saver_def.filename_tensor_name: save_path})
   1300     except errors.NotFoundError as err:
   1301       # There are three common conditions that might cause this error:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    956     try:
    957       result = self._run(None, fetches, feed_dict, options_ptr,
--> 958                          run_metadata_ptr)
    959       if run_metadata:
    960         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1179     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1180       results = self._do_run(handle, final_targets, final_fetches,
-> 1181                              feed_dict_tensor, options, run_metadata)
   1182     else:
   1183       results = []

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1357     if handle is None:
   1358       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1359                            run_metadata)
   1360     else:
   1361       return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1382                     '\nsession_config.graph_options.rewrite_options.'
   1383                     'disable_meta_optimizer = True')
-> 1384       raise type(e)(node_def, op, message)
   1385 
   1386   def _extend_graph(self):

DataLossError: From /job:worker/replica:0/task:0:
not an sstable (bad magic number)
     [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:3806) ]]

Original stack trace for 'save/RestoreV2':
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 456, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 486, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 438, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-8af7c6ccbcf9>", line 5, in <module>
    checkpoint_steps="all"
  File "/usr/local/lib/python3.6/dist-packages/t5/models/mtf_model.py", line 267, in eval
    self._model_dir, dataset_fn, summary_dir, checkpoint_steps)
  File "/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py", line 1302, in eval_model
    decodes = decode(estimator, input_fn, vocabulary, checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py", line 871, in decode
    for i, result in enumerate(result_iter):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3120, in predict
    yield_single_examples=yield_single_examples):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 629, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1038, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 749, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1231, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1236, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 902, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 660, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3806, in _finalize
    wrapped_finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 235, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 607, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 886, in _build
    build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 510, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 389, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 336, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1506, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3327, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1791, in __init__
    self._traceback = tf_stack.extract_stack()

  In call to configurable 'decode' (<function decode at 0x7ff6a85bcbf8>)

I haven't changed the default Evaluate code that came with the Colab notebook

# Use a larger batch size for evaluation, which requires less memory.
model.batch_size = train_batch_size * 4
model.eval(
    mixture_or_task_name="trivia_all",
    checkpoint_steps="all"
)

Here's the link to my notebook: https://colab.research.google.com/drive/1846Xp0UpEgdNTlmKcP0mcvOtsdeqLrxa?usp=sharing

And here's a screenshot of the objects inside my models/small bucket

Is there a problem with how few steps I've fine-tuned the model?

Thank you!

masterzzzen commented 4 years ago

I realized that in

model = t5.models.MtfModel(
    model_dir=MODEL_DIR,
    tpu=TPU_ADDRESS,
    tpu_topology=TPU_TOPOLOGY,
    model_parallelism=model_parallelism,
    batch_size=train_batch_size,
    sequence_length={"inputs": 128, "targets": 32},
    learning_rate_schedule=0.003,
    save_checkpoints_steps=200,
    keep_checkpoint_max=keep_checkpoint_max if ON_CLOUD else None,
    iterations_per_loop=100,
)

save_checkpoints_steps=200 must be smaller than FINETUNE_STEPS And that solved the problem.