TPU training on Colab: Value Error during pretraining

I tried to train T5 from scratch on Colab TPU got a following error:

INFO:tensorflow:Using config: {'_model_dir': './output/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.98.250.170:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.98.250.170:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.98.250.170:8470', '_evaluation_master': 'grpc://10.98.250.170:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu.tpu_cluster_resolver.TPUClusterResolver object at 0x7f9e7be64400>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:Querying Tensorflow master (grpc://10.98.250.170:8470) for TPU system metadata.
INFO:tensorflow:Initializing TPU system (master: grpc://10.98.250.170:8470) to fetch topology for model parallelism. This might take a while.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, -672437715863466660)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 2589290303200140023)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, -5080775782802129919)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -2766800321800475990)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, -1555876536208459395)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 830296862469565083)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -5163406694298225017)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, -890418804849257372)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1262589871115899699)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 575588647642566338)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5137178083114000107)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
/usr/local/lib/python3.6/dist-packages/t5/data/utils.py:273: UserWarning: Creating resources inside a function passed to Dataset.map() is not supported. Create each resource outside the function, and capture it inside the function to use it.
  return dataset.map(my_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 
INFO:tensorflow:num_cores_per_replica: 1
INFO:tensorflow:computation_shape: [1, 1, 1, 1]
INFO:tensorflow:num_replicas: 8
INFO:tensorflow:device_assignment.topology.device_coordinates: [[[0 0 0 0]
  [0 0 0 1]
  [1 0 0 0]
  [1 0 0 1]
  [0 1 0 0]
  [0 1 0 1]
  [1 1 0 0]
  [1 1 0 1]]]
INFO:tensorflow:device_assignment.core_assignment: [[[0 0 0 0]]

 [[0 0 0 1]]

 [[1 0 0 0]]

 [[1 0 0 1]]

 [[0 1 0 0]]

 [[0 1 0 1]]

 [[1 1 0 0]]

 [[1 1 0 1]]]
INFO:tensorflow:auto_logical_to_physical_tpu logical_shape=[4, 2] physical_shape=[2, 2, 2]
INFO:tensorflow:auto_logical_to_physical_tpu logical_shape=[2] physical_shape=[1, 1, 2]
INFO:tensorflow:auto_logical_to_physical_tpu logical_to_physical = [(0, 0, 0), (0, 0, 1)]
INFO:tensorflow:auto_logical_to_physical_tpu logical_to_physical = [(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 0), (1, 1, 1), (1, 0, 0), (1, 0, 1)]
WARNING:tensorflow:SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
INFO:tensorflow:SimdMeshImpl init: Shape[batch=4, model=2] LayoutRules{('batch', 'batch'), ('d_ff', 'model'), ('ensemble', 'ensemble'), ('experts', 'batch'), ('heads', 'model'), ('vocab', 'model')}
INFO:tensorflow:Device Assignment: <tensorflow.python.tpu.device_assignment.DeviceAssignment object at 0x7f9e7aa359e8>
INFO:tensorflow:serialize_num_microbatches: tokens_per_microbatch_per_replica=2048 batch_dim=Dimension(name='batch', size=128) sequence_length={'inputs': 512, 'targets': 512} batch_per_replica=32 num_microbatches=8
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
INFO:tensorflow:Create pnum_tensor
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-e2fe5da2e880> in <module>()
     31     )
     32 
---> 33 model.train(mixture_or_task_name='trivia_all', steps=1000)

/usr/local/lib/python3.6/dist-packages/t5/models/mtf_model.py in train(self, mixture_or_task_name, steps, init_checkpoint, split)
    235     utils.train_model(self.estimator(vocabulary, init_checkpoint), vocabulary,
    236                       self._sequence_length, self.batch_size, dataset_fn,
--> 237                       steps, self._ensemble_inputs, dataset_split=split)
    238 
    239   def eval(self, mixture_or_task_name, checkpoint_steps=None, summary_dir=None,

/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py in train_model(estimator, vocabulary, sequence_length, batch_size, train_dataset_fn, train_steps, ensemble_inputs, dataset_split)
   1496     return dataset
   1497 
-> 1498   estimator.train(input_fn=input_fn, max_steps=train_steps)
   1499 
   1500 

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   3087     finally:
   3088       rendezvous.record_done('training_loop')
-> 3089       rendezvous.raise_errors()
   3090 
   3091   def evaluate(self,

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec)
    148       else:
    149         tf.compat.v1.logging.warn('Reraising captured error')
--> 150         six.reraise(typ, value, traceback)
    151 
    152     for k, (typ, value, traceback) in kept_errors:

/usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb)
    701             if value.__traceback__ is not tb:
    702                 raise value.with_traceback(tb)
--> 703             raise value
    704         finally:
    705             value = None

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
   3082           steps=steps,
   3083           max_steps=max_steps,
-> 3084           saving_listeners=saving_listeners)
   3085     except Exception:  # pylint: disable=broad-except
   3086       rendezvous.record_error('training_loop', sys.exc_info())

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
    347 
    348       saving_listeners = _check_listeners_type(saving_listeners)
--> 349       loss = self._train_model(input_fn, hooks, saving_listeners)
    350       logging.info('Loss for final step: %s.', loss)
    351       return self

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
   1173       return self._train_model_distributed(input_fn, hooks, saving_listeners)
   1174     else:
-> 1175       return self._train_model_default(input_fn, hooks, saving_listeners)
   1176 
   1177   def _train_model_default(self, input_fn, hooks, saving_listeners):

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)
   1202       worker_hooks.extend(input_hooks)
   1203       estimator_spec = self._call_model_fn(features, labels, ModeKeys.TRAIN,
-> 1204                                            self.config)
   1205       global_step_tensor = tf.compat.v1.train.get_global_step(g)
   1206       return self._train_with_estimator_spec(estimator_spec, worker_hooks,

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, mode, config)
   2919     else:
   2920       return super(TPUEstimator, self)._call_model_fn(features, labels, mode,
-> 2921                                                       config)
   2922 
   2923   def _call_model_fn_for_inference(self, features, labels, mode, config):

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in _call_model_fn(self, features, labels, mode, config)
   1161 
   1162     logging.info('Calling model_fn.')
-> 1163     model_fn_results = self._model_fn(features=features, **kwargs)
   1164     logging.info('Done calling model_fn.')
   1165 

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _model_fn(features, labels, mode, config, params)
   3210         if mode == model_fn_lib.ModeKeys.TRAIN:
   3211           compile_op, loss, host_call, scaffold_fn, training_hooks = (
-> 3212               _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
   3213           if ctx.embedding_config:
   3214             g = tf.compat.v1.get_default_graph()

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn)
   3652       num_shards=ctx.num_replicas,
   3653       outputs_from_all_shards=False,
-> 3654       device_assignment=ctx.device_assignment)
   3655 
   3656   loss = loss[0]

/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu.py in split_compile_and_shard(computation, inputs, num_shards, input_shard_axes, outputs_from_all_shards, output_shard_axes, infeed_queue, device_assignment, name)
   1663       infeed_queue=infeed_queue,
   1664       device_assignment=device_assignment,
-> 1665       name=name)
   1666 
   1667   # There must be at least one shard since num_shards > 0.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu.py in split_compile_and_replicate(***failed resolving arguments***)
   1378       vscope.set_custom_getter(custom_getter)
   1379 
-> 1380       outputs = computation(*computation_inputs)
   1381 
   1382       vscope.set_use_resource(saved_use_resource)

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in multi_tpu_train_steps_on_single_shard(replica_id)
   3638           lambda i, loss: i < iterations_per_loop_var,
   3639           lambda i, loss: [i + 1, single_tpu_train_step(i)],
-> 3640           inputs=[0, _INITIAL_LOSS])
   3641       return outputs[1:]
   3642 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/training_loop.py in while_loop(***failed resolving arguments***)
    176     inputs = [array_ops.constant(0)]
    177   return control_flow_ops.while_loop(
--> 178       condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
    179 
    180 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name, maximum_iterations, return_same_structure)
   2694         name=name,
   2695         return_same_structure=return_same_structure,
-> 2696         back_prop=back_prop)
   2697 
   2698   with ops.name_scope(name, "while", loop_vars):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/while_v2.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, maximum_iterations, name, return_same_structure, back_prop)
    194         func_graph=util.WhileBodyFuncGraph(
    195             body_name, collections=ops.get_default_graph()._collections),  # pylint: disable=protected-access
--> 196         add_control_dependencies=add_control_dependencies)
    197     # Add external captures of body to the list of loop vars.
    198     # Note that external tensors will be treated as loop invariants, i.e.,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
    984         _, original_func = tf_decorator.unwrap(python_func)
    985 
--> 986       func_outputs = python_func(*func_args, **func_kwargs)
    987 
    988       # invariant: `func_outputs` contains only Tensors, CompositeTensors,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/while_v2.py in wrapped_body(loop_counter, maximum_iterations_arg, *args)
    172       # `orig_loop_vars` and `args`, converts flows in `args` to TensorArrays
    173       # and packs it into the structure of `orig_loop_vars`.
--> 174       outputs = body(*_pack_sequence_as(orig_loop_vars, args))
    175       if not nest.is_sequence_or_composite(outputs):
    176         outputs = [outputs]

/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/training_loop.py in body_wrapper(*inputs)
    119     else:
    120       dequeue_ops = []
--> 121     outputs = body(*(inputs + dequeue_ops))
    122 
    123     # If the computation only returned one value, make it a tuple.

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in <lambda>(i, loss)
   3637       outputs = training_loop.while_loop(
   3638           lambda i, loss: i < iterations_per_loop_var,
-> 3639           lambda i, loss: [i + 1, single_tpu_train_step(i)],
   3640           inputs=[0, _INITIAL_LOSS])
   3641       return outputs[1:]

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train_step(step)
   1751 
   1752       estimator_spec = self._verify_estimator_spec(
-> 1753           self._call_model_fn(features, labels))
   1754       loss, train_op = estimator_spec.loss, estimator_spec.train_op
   1755 

/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, is_export_mode)
   2029       _add_item_to_params(params, _CTX_KEY, user_context)
   2030 
-> 2031     estimator_spec = self._model_fn(features=features, **kwargs)
   2032     if (running_on_cpu and
   2033         isinstance(estimator_spec, model_fn_lib._TPUEstimatorSpec)):  # pylint: disable=protected-access

/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py in my_model_fn(***failed resolving arguments***)
    670           graph, {mesh: mesh_impl},
    671           autostack=autostack,
--> 672           log_file=model_info_file)
    673 
    674       tf_loss = lowering.export_to_tf_tensor(loss)

/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py in __init__(self, graph, mesh_to_impl, autostack, log_file)
    724       # tf.logging.info("Lowering operation %s" % op.to_string)
    725       with tf.name_scope(op.name):
--> 726         op.lower(self)
    727       for out in op.outputs:
    728         self.add_counter(

/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py in lower(self, lowering)
   4036     mesh_impl = lowering.mesh_impl(self)
   4037     with utils.outside_all_rewrites():
-> 4038       sv = mesh_impl.LaidOutVariable(self, mesh_impl)
   4039     lowering.variables[self] = sv
   4040     lowering.set_tensor_lowering(

/usr/local/lib/python3.6/dist-packages/mesh_tensorflow/simd_mesh_impl.py in __init__(self, variable, mesh_impl)
    186               dtype=variable.slice_dtype,
    187               name=slice_var_name,
--> 188               expected_shape=slice_shape)
    189 
    190         slices.append(slice_var)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    258   def __call__(cls, *args, **kwargs):
    259     if cls is VariableV1:
--> 260       return cls._variable_v1_call(*args, **kwargs)
    261     elif cls is Variable:
    262       return cls._variable_v2_call(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in _variable_v1_call(cls, initial_value, trainable, collections, validate_shape, caching_device, name, variable_def, dtype, expected_shape, import_scope, constraint, use_resource, synchronization, aggregation, shape)
    219         synchronization=synchronization,
    220         aggregation=aggregation,
--> 221         shape=shape)
    222 
    223   def _variable_v2_call(cls,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in <lambda>(**kwargs)
    197                         shape=None):
    198     """Call on Variable class. Useful to force the signature."""
--> 199     previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
    200     for _, getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
    201       previous_getter = _make_getter(getter, previous_getter)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator(next_creator, **kwargs)
   2595         synchronization=synchronization,
   2596         aggregation=aggregation,
-> 2597         shape=shape)
   2598   else:
   2599     return variables.RefVariable(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
--> 264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    265 
    266 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
   1516           aggregation=aggregation,
   1517           shape=shape,
-> 1518           distribute_strategy=distribute_strategy)
   1519 
   1520   def _init_from_args(self,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape)
   1599     if isinstance(initial_value, ops.Tensor) and hasattr(
   1600         initial_value, "graph") and initial_value.graph.building_function:
-> 1601       raise ValueError("Tensor-typed variable initializers must either be "
   1602                        "wrapped in an init_scope or callable "
   1603                        "(e.g., `tf.Variable(lambda : "

ValueError: Tensor-typed variable initializers must either be wrapped in an init_scope or callable (e.g., `tf.Variable(lambda : tf.truncated_normal([10, 40]))`) when building functions. Please file a feature request if this restriction inconveniences you.

I also found that layer named by shared/embedding_slice_0 was not properly initialized as follows:

NAME: shared/embedding_slice_0
OBJECT: Tensor("shared/embedding/zeros:0", shape=(15104, 768), dtype=float32, device=/job:worker/task:0/device:CPU:0)

while the other layers were initialized as follows

NAME: decoder/final_layer_norm/scale_slot_v
OBJECT: <function _VariableStore._get_single_variable.<locals>.<lambda> at 0x7f0628468840>

here is link to source notebook: https://github.com/Livenn/t5_test

google-research / text-to-text-transfer-transformer

TPU training on Colab: Value Error during pretraining #440