BERT training encounter errors: ValueError: You must specify an aggregation method to update a MirroredVariable in Replica Context.

I attempt to training language_model/tensorflow/bert on 4*NVIDIA GPUs using the provided docker:tensorflow/tensorflow:2.2.0rc0-gpu-py3, when run training, there are some errors:

INFO:tensorflow:++++++ warmup starts at step 0, for 10000 steps ++++++
I0219 16:39:04.603151 140357434070784 api.py:348] ++++++ warmup starts at step 0, for 10000 steps ++++++
INFO:tensorflow:using adamw
I0219 16:39:04.610883 140357434070784 api.py:348] using adamw
INFO:tensorflow:Error reported to Coordinator: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1152 _call_model_fn  *
        model_fn_results = self._model_fn(features=features, **kwargs)
    run_pretraining.py:180 model_fn  *
        train_op = optimization.create_optimizer(
    /home/test/FIM/MLPerf_training/language_model/tensorflow/bert/optimization.py:89 create_optimizer  *
        train_op = optimizer.apply_gradients(
    /home/test/FIM/MLPerf_training/language_model/tensorflow/bert/optimization.py:168 apply_gradients  *
        assignments.extend(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/values.py:798 assign  **
        return self._mirrored_update(assign_fn, *args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/values.py:762 _mirrored_update
        _aggregation_error_msg.format(variable_type="MirroredVariable"))

    ValueError: You must specify an aggregation method to update a MirroredVariable in Replica Context. You can do so by passing an explicit value for argument `aggregation` to tf.Variable(..).e.g. `tf.Variable(..., aggregation=tf.VariableAggregation.SUM)``tf.VariableAggregation` lists the possible aggregation methods.This is required because MirroredVariable should always be kept in sync. When updating them or assigning to them in a replica context, we automatically try to aggregate the values before updating the variable. For this aggregation, we need to know the aggregation method. Another alternative is to not try to update such MirroredVariable in replica context, but in cross replica context. You can enter cross replica context by calling `tf.distribute.get_replica_context().merge_call(merge_fn, ..)`.Inside `merge_fn`, you can then update the MirroredVariable using `tf.distribute.StrategyExtended.update()`.

run script: I don't have checkpoint file, so set --init_checkpoint None.

TF_XLA_FLAGS='--tf_xla_auto_jit=2' \
python run_pretraining.py \
  --bert_config_file=$MODEL_CONFIG_FILE \
  --output_dir=/tmp/output/ \
  --input_file=$INPUT_FILE \
  --nodo_eval \
  --do_train \
  --eval_batch_size=8 \
  --learning_rate=4e-05 \
  # --init_checkpoint=$MODEL_CHECKPOINT \
  --iterations_per_loop=1000 \
  --max_predictions_per_seq=76 \
  --max_seq_length=512 \
  --num_train_steps=682666666 \
  --num_warmup_steps=1562 \
  --optimizer=lamb \
  --save_checkpoints_steps=20833 \
  --start_warmup_step=0 \
  --num_gpus=4 \
  --train_batch_size=12

mlcommons / training

BERT training encounter errors: ValueError: You must specify an aggregation method to update a MirroredVariable in Replica Context. #452