Closed git-jhyang closed 3 years ago
I tried to train T5 from scratch on Colab TPU got a following error:
INFO:tensorflow:Using config: {'_model_dir': './output/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true cluster_def { job { name: "worker" tasks { key: 0 value: "10.98.250.170:8470" } } } isolate_session_state: true , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.98.250.170:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.98.250.170:8470', '_evaluation_master': 'grpc://10.98.250.170:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu.tpu_cluster_resolver.TPUClusterResolver object at 0x7f9e7be64400>} INFO:tensorflow:_TPUContext: eval_on_tpu True INFO:tensorflow:Querying Tensorflow master (grpc://10.98.250.170:8470) for TPU system metadata. INFO:tensorflow:Initializing TPU system (master: grpc://10.98.250.170:8470) to fetch topology for model parallelism. This might take a while. INFO:tensorflow:Found TPU system: INFO:tensorflow:*** Num TPU Cores: 8 INFO:tensorflow:*** Num TPU Workers: 1 INFO:tensorflow:*** Num TPU Cores Per Worker: 8 INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, -672437715863466660) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 2589290303200140023) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, -5080775782802129919) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -2766800321800475990) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, -1555876536208459395) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 830296862469565083) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -5163406694298225017) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, -890418804849257372) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1262589871115899699) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 575588647642566338) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5137178083114000107) WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. INFO:tensorflow:Calling model_fn. /usr/local/lib/python3.6/dist-packages/t5/data/utils.py:273: UserWarning: Creating resources inside a function passed to Dataset.map() is not supported. Create each resource outside the function, and capture it inside the function to use it. return dataset.map(my_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE) INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 INFO:tensorflow:num_cores_per_replica: 1 INFO:tensorflow:computation_shape: [1, 1, 1, 1] INFO:tensorflow:num_replicas: 8 INFO:tensorflow:device_assignment.topology.device_coordinates: [[[0 0 0 0] [0 0 0 1] [1 0 0 0] [1 0 0 1] [0 1 0 0] [0 1 0 1] [1 1 0 0] [1 1 0 1]]] INFO:tensorflow:device_assignment.core_assignment: [[[0 0 0 0]] [[0 0 0 1]] [[1 0 0 0]] [[1 0 0 1]] [[0 1 0 0]] [[0 1 0 1]] [[1 1 0 0]] [[1 1 0 1]]] INFO:tensorflow:auto_logical_to_physical_tpu logical_shape=[4, 2] physical_shape=[2, 2, 2] INFO:tensorflow:auto_logical_to_physical_tpu logical_shape=[2] physical_shape=[1, 1, 2] INFO:tensorflow:auto_logical_to_physical_tpu logical_to_physical = [(0, 0, 0), (0, 0, 1)] INFO:tensorflow:auto_logical_to_physical_tpu logical_to_physical = [(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 0), (1, 1, 1), (1, 0, 0), (1, 0, 1)] WARNING:tensorflow:SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', ''] INFO:tensorflow:SimdMeshImpl init: Shape[batch=4, model=2] LayoutRules{('batch', 'batch'), ('d_ff', 'model'), ('ensemble', 'ensemble'), ('experts', 'batch'), ('heads', 'model'), ('vocab', 'model')} INFO:tensorflow:Device Assignment: <tensorflow.python.tpu.device_assignment.DeviceAssignment object at 0x7f9e7aa359e8> INFO:tensorflow:serialize_num_microbatches: tokens_per_microbatch_per_replica=2048 batch_dim=Dimension(name='batch', size=128) sequence_length={'inputs': 512, 'targets': 512} batch_per_replica=32 num_microbatches=8 WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order. WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order. INFO:tensorflow:Create pnum_tensor INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-14-e2fe5da2e880> in <module>() 31 ) 32 ---> 33 model.train(mixture_or_task_name='trivia_all', steps=1000) /usr/local/lib/python3.6/dist-packages/t5/models/mtf_model.py in train(self, mixture_or_task_name, steps, init_checkpoint, split) 235 utils.train_model(self.estimator(vocabulary, init_checkpoint), vocabulary, 236 self._sequence_length, self.batch_size, dataset_fn, --> 237 steps, self._ensemble_inputs, dataset_split=split) 238 239 def eval(self, mixture_or_task_name, checkpoint_steps=None, summary_dir=None, /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py in train_model(estimator, vocabulary, sequence_length, batch_size, train_dataset_fn, train_steps, ensemble_inputs, dataset_split) 1496 return dataset 1497 -> 1498 estimator.train(input_fn=input_fn, max_steps=train_steps) 1499 1500 /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners) 3087 finally: 3088 rendezvous.record_done('training_loop') -> 3089 rendezvous.raise_errors() 3090 3091 def evaluate(self, /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec) 148 else: 149 tf.compat.v1.logging.warn('Reraising captured error') --> 150 six.reraise(typ, value, traceback) 151 152 for k, (typ, value, traceback) in kept_errors: /usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb) 701 if value.__traceback__ is not tb: 702 raise value.with_traceback(tb) --> 703 raise value 704 finally: 705 value = None /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners) 3082 steps=steps, 3083 max_steps=max_steps, -> 3084 saving_listeners=saving_listeners) 3085 except Exception: # pylint: disable=broad-except 3086 rendezvous.record_error('training_loop', sys.exc_info()) /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners) 347 348 saving_listeners = _check_listeners_type(saving_listeners) --> 349 loss = self._train_model(input_fn, hooks, saving_listeners) 350 logging.info('Loss for final step: %s.', loss) 351 return self /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners) 1173 return self._train_model_distributed(input_fn, hooks, saving_listeners) 1174 else: -> 1175 return self._train_model_default(input_fn, hooks, saving_listeners) 1176 1177 def _train_model_default(self, input_fn, hooks, saving_listeners): /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners) 1202 worker_hooks.extend(input_hooks) 1203 estimator_spec = self._call_model_fn(features, labels, ModeKeys.TRAIN, -> 1204 self.config) 1205 global_step_tensor = tf.compat.v1.train.get_global_step(g) 1206 return self._train_with_estimator_spec(estimator_spec, worker_hooks, /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, mode, config) 2919 else: 2920 return super(TPUEstimator, self)._call_model_fn(features, labels, mode, -> 2921 config) 2922 2923 def _call_model_fn_for_inference(self, features, labels, mode, config): /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py in _call_model_fn(self, features, labels, mode, config) 1161 1162 logging.info('Calling model_fn.') -> 1163 model_fn_results = self._model_fn(features=features, **kwargs) 1164 logging.info('Done calling model_fn.') 1165 /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _model_fn(features, labels, mode, config, params) 3210 if mode == model_fn_lib.ModeKeys.TRAIN: 3211 compile_op, loss, host_call, scaffold_fn, training_hooks = ( -> 3212 _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn)) 3213 if ctx.embedding_config: 3214 g = tf.compat.v1.get_default_graph() /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn) 3652 num_shards=ctx.num_replicas, 3653 outputs_from_all_shards=False, -> 3654 device_assignment=ctx.device_assignment) 3655 3656 loss = loss[0] /usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu.py in split_compile_and_shard(computation, inputs, num_shards, input_shard_axes, outputs_from_all_shards, output_shard_axes, infeed_queue, device_assignment, name) 1663 infeed_queue=infeed_queue, 1664 device_assignment=device_assignment, -> 1665 name=name) 1666 1667 # There must be at least one shard since num_shards > 0. /usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu.py in split_compile_and_replicate(***failed resolving arguments***) 1378 vscope.set_custom_getter(custom_getter) 1379 -> 1380 outputs = computation(*computation_inputs) 1381 1382 vscope.set_use_resource(saved_use_resource) /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in multi_tpu_train_steps_on_single_shard(replica_id) 3638 lambda i, loss: i < iterations_per_loop_var, 3639 lambda i, loss: [i + 1, single_tpu_train_step(i)], -> 3640 inputs=[0, _INITIAL_LOSS]) 3641 return outputs[1:] 3642 /usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/training_loop.py in while_loop(***failed resolving arguments***) 176 inputs = [array_ops.constant(0)] 177 return control_flow_ops.while_loop( --> 178 condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1) 179 180 /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name, maximum_iterations, return_same_structure) 2694 name=name, 2695 return_same_structure=return_same_structure, -> 2696 back_prop=back_prop) 2697 2698 with ops.name_scope(name, "while", loop_vars): /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/while_v2.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, maximum_iterations, name, return_same_structure, back_prop) 194 func_graph=util.WhileBodyFuncGraph( 195 body_name, collections=ops.get_default_graph()._collections), # pylint: disable=protected-access --> 196 add_control_dependencies=add_control_dependencies) 197 # Add external captures of body to the list of loop vars. 198 # Note that external tensors will be treated as loop invariants, i.e., /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes) 984 _, original_func = tf_decorator.unwrap(python_func) 985 --> 986 func_outputs = python_func(*func_args, **func_kwargs) 987 988 # invariant: `func_outputs` contains only Tensors, CompositeTensors, /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/while_v2.py in wrapped_body(loop_counter, maximum_iterations_arg, *args) 172 # `orig_loop_vars` and `args`, converts flows in `args` to TensorArrays 173 # and packs it into the structure of `orig_loop_vars`. --> 174 outputs = body(*_pack_sequence_as(orig_loop_vars, args)) 175 if not nest.is_sequence_or_composite(outputs): 176 outputs = [outputs] /usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/training_loop.py in body_wrapper(*inputs) 119 else: 120 dequeue_ops = [] --> 121 outputs = body(*(inputs + dequeue_ops)) 122 123 # If the computation only returned one value, make it a tuple. /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in <lambda>(i, loss) 3637 outputs = training_loop.while_loop( 3638 lambda i, loss: i < iterations_per_loop_var, -> 3639 lambda i, loss: [i + 1, single_tpu_train_step(i)], 3640 inputs=[0, _INITIAL_LOSS]) 3641 return outputs[1:] /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train_step(step) 1751 1752 estimator_spec = self._verify_estimator_spec( -> 1753 self._call_model_fn(features, labels)) 1754 loss, train_op = estimator_spec.loss, estimator_spec.train_op 1755 /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, is_export_mode) 2029 _add_item_to_params(params, _CTX_KEY, user_context) 2030 -> 2031 estimator_spec = self._model_fn(features=features, **kwargs) 2032 if (running_on_cpu and 2033 isinstance(estimator_spec, model_fn_lib._TPUEstimatorSpec)): # pylint: disable=protected-access /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/transformer/utils.py in my_model_fn(***failed resolving arguments***) 670 graph, {mesh: mesh_impl}, 671 autostack=autostack, --> 672 log_file=model_info_file) 673 674 tf_loss = lowering.export_to_tf_tensor(loss) /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py in __init__(self, graph, mesh_to_impl, autostack, log_file) 724 # tf.logging.info("Lowering operation %s" % op.to_string) 725 with tf.name_scope(op.name): --> 726 op.lower(self) 727 for out in op.outputs: 728 self.add_counter( /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/ops.py in lower(self, lowering) 4036 mesh_impl = lowering.mesh_impl(self) 4037 with utils.outside_all_rewrites(): -> 4038 sv = mesh_impl.LaidOutVariable(self, mesh_impl) 4039 lowering.variables[self] = sv 4040 lowering.set_tensor_lowering( /usr/local/lib/python3.6/dist-packages/mesh_tensorflow/simd_mesh_impl.py in __init__(self, variable, mesh_impl) 186 dtype=variable.slice_dtype, 187 name=slice_var_name, --> 188 expected_shape=slice_shape) 189 190 slices.append(slice_var) /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs) 258 def __call__(cls, *args, **kwargs): 259 if cls is VariableV1: --> 260 return cls._variable_v1_call(*args, **kwargs) 261 elif cls is Variable: 262 return cls._variable_v2_call(*args, **kwargs) /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in _variable_v1_call(cls, initial_value, trainable, collections, validate_shape, caching_device, name, variable_def, dtype, expected_shape, import_scope, constraint, use_resource, synchronization, aggregation, shape) 219 synchronization=synchronization, 220 aggregation=aggregation, --> 221 shape=shape) 222 223 def _variable_v2_call(cls, /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in <lambda>(**kwargs) 197 shape=None): 198 """Call on Variable class. Useful to force the signature.""" --> 199 previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs) 200 for _, getter in ops.get_default_graph()._variable_creator_stack: # pylint: disable=protected-access 201 previous_getter = _make_getter(getter, previous_getter) /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator(next_creator, **kwargs) 2595 synchronization=synchronization, 2596 aggregation=aggregation, -> 2597 shape=shape) 2598 else: 2599 return variables.RefVariable( /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs) 262 return cls._variable_v2_call(*args, **kwargs) 263 else: --> 264 return super(VariableMetaclass, cls).__call__(*args, **kwargs) 265 266 /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape) 1516 aggregation=aggregation, 1517 shape=shape, -> 1518 distribute_strategy=distribute_strategy) 1519 1520 def _init_from_args(self, /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape) 1599 if isinstance(initial_value, ops.Tensor) and hasattr( 1600 initial_value, "graph") and initial_value.graph.building_function: -> 1601 raise ValueError("Tensor-typed variable initializers must either be " 1602 "wrapped in an init_scope or callable " 1603 "(e.g., `tf.Variable(lambda : " ValueError: Tensor-typed variable initializers must either be wrapped in an init_scope or callable (e.g., `tf.Variable(lambda : tf.truncated_normal([10, 40]))`) when building functions. Please file a feature request if this restriction inconveniences you.
I also found that layer named by shared/embedding_slice_0 was not properly initialized as follows:
NAME: shared/embedding_slice_0 OBJECT: Tensor("shared/embedding/zeros:0", shape=(15104, 768), dtype=float32, device=/job:worker/task:0/device:CPU:0)
while the other layers were initialized as follows
NAME: decoder/final_layer_norm/scale_slot_v OBJECT: <function _VariableStore._get_single_variable.<locals>.<lambda> at 0x7f0628468840>
here is link to source notebook: https://github.com/Livenn/t5_test
You need to call tf.disable_v2_behavior() after importing TensorFlow to use mesh. Please re-open if this doesn't work.
tf.disable_v2_behavior()
I tried to train T5 from scratch on Colab TPU got a following error:
I also found that layer named by shared/embedding_slice_0 was not properly initialized as follows:
while the other layers were initialized as follows
here is link to source notebook: https://github.com/Livenn/t5_test