google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.16k stars 9.6k forks source link

module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' #1135

Closed liuyibox closed 4 years ago

liuyibox commented 4 years ago

I am trying to pretrain a bert from google's pretrained checkpoint from Colab TPU. Until yesterday everything is fine. However, I came across this 'crossshardoptimizer' error for all day today. I am wondering if this caused by any code base change or version migration.

tf version: 1.15.2 python: 3.6 bert-tensorflow: 1.0.3

INFO:tensorflow: Input Files (MSL-128) INFO:tensorflow: gs://vbert/input/vmware-docs-2020-reddit_non-wwm_msl-128_vocab-vmware-unused.tfrecord INFO:tensorflow: Input Files (MSL-512) INFO:tensorflow: gs://vbert/input/vmware-docs-2020-reddit_non-wwm_msl-512_vocab-vmware-unused.tfrecord WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder..model_fn at 0x7f5054197bf8>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:Using config: {'_model_dir': 'gs://vbert/liuyi-vbert-docs-reddit/base/vocab-vmware-unused', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true cluster_def { job { name: "worker" tasks { key: 0 value: "10.47.24.194:8470" } } } isolate_session_state: true , '_keep_checkpoint_max': 10000, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f505413deb8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.47.24.194:8470', '_evaluation_master': 'grpc://10.47.24.194:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=10000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x7f505413dc50>} INFO:tensorflow:_TPUContext: eval_on_tpu True INFO:tensorflow: Running training INFO:tensorflow: Batch size = 32 INFO:tensorflow:Querying Tensorflow master (grpc://10.47.24.194:8470) for TPU system metadata. INFO:tensorflow:Found TPU system: INFO:tensorflow: Num TPU Cores: 8 INFO:tensorflow: Num TPU Workers: 1 INFO:tensorflow: Num TPU Cores Per Worker: 8 INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 18293633603678532293) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 16754746863277155707) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 12168993875110325416) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 5785133627713800739) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 531464872121750804) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 13610383926908237188) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 3588204162670013970) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 5523440629424163654) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 9311023021754933234) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 17907827073552055203) INFO:tensorflow: Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5163179840106115260) INFO:tensorflow:Calling model_fn. WARNING:tensorflow:Entity <function input_fn_builder..input_fn.. at 0x7f50541971e0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: module 'gast' has no attribute 'Str' WARNING: Entity <function input_fn_builder..input_fn.. at 0x7f50541971e0> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: module 'gast' has no attribute 'Str' INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow:Found small feature: next_sentence_labels [4, 1] INFO:tensorflow: Features INFO:tensorflow: name = input_ids, shape = (4, 128) INFO:tensorflow: name = input_mask, shape = (4, 128) INFO:tensorflow: name = masked_lm_ids, shape = (4, 20) INFO:tensorflow: name = masked_lm_positions, shape = (4, 20) INFO:tensorflow: name = masked_lm_weights, shape = (4, 20) INFO:tensorflow: name = next_sentence_labels, shape = (4, 1) INFO:tensorflow: name = segment_ids, shape = (4, 128) INFO:tensorflow: Trainable Variables ERROR:tensorflow:Error recorded from training_loop: module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error


AttributeError Traceback (most recent call last)

in () 3 start_time = datetime.now() 4 FLAGS.training_start_time = start_time ----> 5 main() 6 print("Pretraining took", datetime.now() - start_time) 25 frames in main() 93 max_predictions_per_seq=FLAGS.max_predictions_per_seq, 94 is_training=True) ---> 95 estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, saving_listeners=[listener]) 96 97 FLAGS.loop_times = loop_times /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners) 3033 finally: 3034 rendezvous.record_done('training_loop') -> 3035 rendezvous.raise_errors() 3036 3037 def evaluate(self, /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/error_handling.py in raise_errors(self, timeout_sec) 134 else: 135 logging.warn('Reraising captured error') --> 136 six.reraise(typ, value, traceback) 137 138 for k, (typ, value, traceback) in kept_errors: /usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb) 701 if value.__traceback__ is not tb: 702 raise value.with_traceback(tb) --> 703 raise value 704 finally: 705 value = None /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners) 3028 steps=steps, 3029 max_steps=max_steps, -> 3030 saving_listeners=saving_listeners) 3031 except Exception: # pylint: disable=broad-except 3032 rendezvous.record_error('training_loop', sys.exc_info()) /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners) 368 369 saving_listeners = _check_listeners_type(saving_listeners) --> 370 loss = self._train_model(input_fn, hooks, saving_listeners) 371 logging.info('Loss for final step: %s.', loss) 372 return self /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _train_model(self, input_fn, hooks, saving_listeners) 1159 return self._train_model_distributed(input_fn, hooks, saving_listeners) 1160 else: -> 1161 return self._train_model_default(input_fn, hooks, saving_listeners) 1162 1163 def _train_model_default(self, input_fn, hooks, saving_listeners): /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners) 1189 worker_hooks.extend(input_hooks) 1190 estimator_spec = self._call_model_fn( -> 1191 features, labels, ModeKeys.TRAIN, self.config) 1192 global_step_tensor = training_util.get_global_step(g) 1193 return self._train_with_estimator_spec(estimator_spec, worker_hooks, /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, mode, config) 2855 else: 2856 return super(TPUEstimator, self)._call_model_fn(features, labels, mode, -> 2857 config) 2858 else: 2859 if mode == _INFERENCE_ON_TPU_MODE: /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py in _call_model_fn(self, features, labels, mode, config) 1147 1148 logging.info('Calling model_fn.') -> 1149 model_fn_results = self._model_fn(features=features, **kwargs) 1150 logging.info('Done calling model_fn.') 1151 /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _model_fn(features, labels, mode, config, params) 3157 if mode == model_fn_lib.ModeKeys.TRAIN: 3158 compile_op, loss, host_call, scaffold_fn, training_hooks = ( -> 3159 _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn)) 3160 if ctx.embedding_config: 3161 g = ops.get_default_graph() /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn) 3602 num_shards=ctx.num_replicas, 3603 outputs_from_all_shards=False, -> 3604 device_assignment=ctx.device_assignment) 3605 3606 loss = loss[0] /tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/tpu.py in split_compile_and_shard(computation, inputs, num_shards, input_shard_axes, outputs_from_all_shards, output_shard_axes, infeed_queue, device_assignment, name) 1275 infeed_queue=infeed_queue, 1276 device_assignment=device_assignment, -> 1277 name=name) 1278 1279 # There must be at least one shard since num_shards > 0. /tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/tpu.py in split_compile_and_replicate(***failed resolving arguments***) 990 vscope.set_custom_getter(custom_getter) 991 --> 992 outputs = computation(*computation_inputs) 993 994 vscope.set_use_resource(saved_use_resource) /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in multi_tpu_train_steps_on_single_shard(replica_id) 3587 lambda i, loss: i < iterations_per_loop_var, 3588 lambda i, loss: [i + 1, single_tpu_train_step(i)], -> 3589 inputs=[0, _INITIAL_LOSS]) 3590 return outputs[1:] 3591 /tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/training_loop.py in while_loop(***failed resolving arguments***) 176 inputs = [array_ops.constant(0)] 177 return control_flow_ops.while_loop( --> 178 condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1) 179 180 /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name, maximum_iterations, return_same_structure) 2751 ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, loop_context) 2752 result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants, -> 2753 return_same_structure) 2754 if maximum_iterations is not None: 2755 return result[1] /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in BuildLoop(self, pred, body, loop_vars, shape_invariants, return_same_structure) 2243 with ops.get_default_graph()._mutation_lock(): # pylint: disable=protected-access 2244 original_body_result, exit_vars = self._BuildLoop( -> 2245 pred, body, original_loop_vars, loop_vars, shape_invariants) 2246 finally: 2247 self.Exit() /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/control_flow_ops.py in _BuildLoop(self, pred, body, original_loop_vars, loop_vars, shape_invariants) 2168 expand_composites=True) 2169 pre_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access -> 2170 body_result = body(*packed_vars_for_body) 2171 post_summaries = ops.get_collection(ops.GraphKeys._SUMMARY_COLLECTION) # pylint: disable=protected-access 2172 if not nest.is_sequence_or_composite(body_result): /tensorflow-1.15.2/python3.6/tensorflow_core/python/tpu/training_loop.py in body_wrapper(*inputs) 119 else: 120 dequeue_ops = [] --> 121 outputs = body(*(inputs + dequeue_ops)) 122 123 # If the computation only returned one value, make it a tuple. /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in (i, loss) 3586 outputs = training_loop.while_loop( 3587 lambda i, loss: i < iterations_per_loop_var, -> 3588 lambda i, loss: [i + 1, single_tpu_train_step(i)], 3589 inputs=[0, _INITIAL_LOSS]) 3590 return outputs[1:] /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in train_step(step) 1713 1714 estimator_spec = self._verify_estimator_spec( -> 1715 self._call_model_fn(features, labels)) 1716 loss, train_op = estimator_spec.loss, estimator_spec.train_op 1717 /tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py in _call_model_fn(self, features, labels, is_export_mode) 1992 _add_item_to_params(params, _CTX_KEY, user_context) 1993 -> 1994 estimator_spec = self._model_fn(features=features, **kwargs) 1995 if (running_on_cpu and 1996 isinstance(estimator_spec, model_fn_lib._TPUEstimatorSpec)): # pylint: disable=protected-access in model_fn(features, labels, mode, params) 67 if mode == tf.estimator.ModeKeys.TRAIN: 68 train_op = optimization.create_optimizer( ---> 69 total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 70 71 output_spec = tf.contrib.tpu.TPUEstimatorSpec( /usr/local/lib/python3.6/dist-packages/bert/optimization.py in create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu) 66 67 if use_tpu: ---> 68 optimizer = tf.estimator.tpu.CrossShardOptimizer(optimizer) 69 70 tvars = tf.trainable_variables() /tensorflow-1.15.2/python3.6/tensorflow_core/python/util/module_wrapper.py in __getattr__(self, name) 191 def __getattr__(self, name): 192 try: --> 193 attr = getattr(self._tfmw_wrapped_module, name) 194 except AttributeError: 195 if not self._tfmw_public_apis: AttributeError: module 'tensorflow_estimator.python.estimator.api._v1.estimator.tpu' has no attribute 'CrossShardOptimizer' Any insights and discussions are appreciated. Thanks.
liuyibox commented 4 years ago

This is due to bert version update, and is resolved by using "pip install bert-tensorflow==1.0.1" as mentioned here