11B model training on TPU V3-512 Crashes during training

agemagician commented 4 years ago

Hello,

We have started large scale training for the 11B model on TPU V3-512, but the model keeps crashing and trying to recover during training:

I0626 06:03:25.061087 139703188657984 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 8300...
INFO:tensorflow:Before Save.
I0626 06:03:25.061593 139703188657984 ops.py:5742] Before Save.
INFO:tensorflow:About to write a checkpoint
I0626 06:03:29.075753 139703188657984 ops.py:5744] About to write a checkpoint
INFO:tensorflow:Saving checkpoints for 8300 into gs://xxxxxxxx/11b/model.ckpt.
I0626 06:03:29.076128 139703188657984 basic_session_run_hooks.py:618] Saving checkpoints for 8300 into gs://prot-transformers-eu/t5/models/un
iref100/11b/model.ckpt.
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be c
losed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in
the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: From /job:
worker/replica:0/task:15:
All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: HTTP response code 503
         when resuming upload gs://xxx/11b/model.ckpt-8300_temp_289083b0ae5e4c4891b42a91ff3cf66f/
         [[node save/SaveV2_7 (defined at /site-packages/mesh_tensorflow/transformer/utils.py:720) ]]

Errors may have originated from an input operation.
Input Source operations connected to node save/SaveV2_7:
 encoder/block_000/layer_000/SelfAttention/relative_attention_bias/Read/ReadVariableOp (defined at /site-packages/mesh_tensorflow/ops.py:4020
)

The error usually occurs when it tries to save a new checkpoint, when it happens it doesn't store the checkpoint and it reloads the previous checkpoint.

I also notice the loss is heavily affected when this issue occurs, the blue line is the base version trained on Colab pro and the green line is the 11B version trained on the TPU pod.

And ideas what could be the cause of this problem and how to overcome it ?

@adarob @craffel @sharannarang @nshazeer , Your feedback is highly appreciated.

agemagician commented 4 years ago

I have tried both Tensorflow 2.2 and Tensorflow 1.5.13, the problem exist in both of them.

agemagician commented 4 years ago

I have tried mesh-tensorflow 0.1.13 and 0.1.16, the problem exist in both of them.

agemagician commented 4 years ago

This is my current running command:


python -m t5.models.mesh_transformer_main \
  --module_import="xxx_task" \
  --tpu="node-1" \
  --gcp_project="xxx" \
  --tpu_zone="europe-west4-a" \
  --model_dir="gs://xxx/11b/" \
  --gin_file="objectives/span_3_15_u_u.gin" \
  --gin_file="models/t5.1.0.11B.gin" \
  --gin_file="dataset.gin" \
  --gin_file="learning_rate_schedules/rsqrt_no_ramp_down.gin" \
  --gin_param="MIXTURE_NAME = 'task_xxx'" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology = '16x16'" \
  --gin_param="utils.tpu_mesh_shape.model_parallelism = 32" \
  --gin_param="utils.run.save_checkpoints_steps=2000" \
  --gin_param="utils.run.batch_size=('tokens_per_batch', 1048576)" \
  --gin_param="utils.run.train_steps=1000000" \
  --gin_param="utils.run.iterations_per_loop=100" \
  --gin_param="learning_rate_schedule_noam.warmup_steps=10000" \
  --gin_param="SentencePieceVocabulary.extra_ids=100" \
  --gin_param="run.perplexity_eval_steps=100"

agemagician commented 4 years ago

I have also tested it with T5 Large version and the same error occurs. The problem is just randomly occurs when it tries to save the model.

INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also 
occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job
. Error: From /job:worker/replica:0/task:1:                                                                                                                                                               
All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: HTTP response code 503                                                                                      
         when resuming upload gs://xxxxx/large/model.ckpt-60_temp_838179b3efeb4ecb90a79af73c170014/                                                                    
         [[node save/SaveV2_1 (defined at /site-packages/mesh_tensorflow/transformer/utils.py:720) ]]                                                                                                     

Errors may have originated from an input operation.                                                                                                                                                       
Input Source operations connected to node save/SaveV2_1:                                                                                                                                                  
 encoder/block_019/layer_000/SelfAttention/o_slot_vr/Read/ReadVariableOp (defined at /site-packages/mesh_tensorflow/ops.py:4020)     

Original stack trace for 'save/SaveV2_1':                                                                                                                                                                 
  File "/runpy.py", line 193, in _run_module_as_main                                                                                                                                                      
    "__main__", mod_spec)                                                                                                                                                                                 
  File "/runpy.py", line 85, in _run_code                                                                                                                                                                 
    exec(code, run_globals)                                                                                                                                                                               
  File "/site-packages/t5/models/mesh_transformer_main.py", line 240, in <module>                                                                                                                         
    console_entry_point()                                                                                                                                                                                 
  File "/site-packages/t5/models/mesh_transformer_main.py", line 237, in console_entry_point                                                                                                              
    app.run(main)                                                                                                                                                                                         
  File "/site-packages/absl/app.py", line 299, in run                                                                                                                                                     
    _run_main(main, args)                                                                                                                                                                                 
  File "/site-packages/absl/app.py", line 250, in _run_main                                                                                                                                               
    sys.exit(main(argv))                                                                                                                                                                                  
  File "/site-packages/t5/models/mesh_transformer_main.py", line 231, in main
    model_dir=FLAGS.model_dir)
  File "/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/site-packages/mesh_tensorflow/transformer/utils.py", line 2115, in run
    train_dataset_fn, train_steps, ensemble_inputs)
  File "/site-packages/mesh_tensorflow/transformer/utils.py", line 1498, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3078, in train
    saving_listeners=saving_listeners)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1182, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1211, in _train_model_default
    self.config)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2915, in _call_model_fn
    config)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1170, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3206, in _model_fn
    _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3648, in _train_on_tpu_system
    device_assignment=ctx.device_assignment)
  File "/site-packages/tensorflow/python/tpu/tpu.py", line 1565, in split_compile_and_shard
    name=name)
  File "/site-packages/tensorflow/python/tpu/tpu.py", line 1280, in split_compile_and_replicate
    outputs = computation(*computation_inputs)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3634, in multi_tpu_train_steps_on_single_shard
    inputs=[0, _INITIAL_LOSS])
  File "/site-packages/tensorflow/python/tpu/training_loop.py", line 178, in while_loop
    condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
  File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2766, in while_loop
    return_same_structure)
  File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2248, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2173, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/site-packages/tensorflow/python/tpu/training_loop.py", line 121, in body_wrapper
    outputs = body(*(inputs + dequeue_ops))
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3633, in <lambda>
    lambda i, loss: [i + 1, single_tpu_train_step(i)],
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1753, in train_step
    self._call_model_fn(features, labels))
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2031, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/site-packages/mesh_tensorflow/transformer/utils.py", line 720, in my_model_fn
    save_relative_paths=True)
  File "/site-packages/tensorflow/python/training/saver.py", line 836, in __init__
    self.build()
  File "/site-packages/tensorflow/python/training/saver.py", line 848, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/site-packages/tensorflow/python/training/saver.py", line 886, in _build
    build_restore=build_restore)
  File "/site-packages/tensorflow/python/training/saver.py", line 507, in _build_internal
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "/site-packages/tensorflow/python/training/saver.py", line 299, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "/site-packages/tensorflow/python/training/saver.py", line 273, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
  File "/site-packages/tensorflow/python/training/saver.py", line 206, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
  File "/site-packages/tensorflow/python/training/saver.py", line 122, in save_op
    tensors)
  File "/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1717, in save_v2
    name=name)
  File "/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/site-packages/tensorflow/python/framework/ops.py", line 3327, in _create_op_internal
    op_def=op_def)
  File "/site-packages/tensorflow/python/framework/ops.py", line 1791, in __init__
    self._traceback = tf_stack.extract_stack()

agemagician commented 4 years ago

In case its helpful:

pip list
Package                  Version
------------------------ ---------------------
absl-py                  0.9.0
astunparse               1.6.3
attrs                    19.3.0
Babel                    2.8.0
boto                     2.49.0
cachetools               4.1.0
certifi                  2020.6.20
chardet                  3.0.4
click                    7.1.2
dill                     0.3.2
distro                   1.5.0
filelock                 3.0.12
future                   0.18.2
gast                     0.3.3
gevent                   20.6.2
gin-config               0.3.0
google-api-core          1.21.0
google-api-python-client 1.9.3
google-auth              1.18.0
google-auth-httplib2     0.0.3
google-auth-oauthlib     0.4.1
google-cloud-core        1.3.0
google-cloud-storage     1.29.0
google-compute-engine    2.8.13
google-pasta             0.2.0
google-resumable-media   0.5.1
googleapis-common-protos 1.52.0
greenlet                 0.4.16
grpcio                   1.30.0
h5py                     2.10.0
httplib2                 0.18.1
idna                     2.9
importlib-metadata       1.6.1
joblib                   0.15.1
Keras-Preprocessing      1.1.2
Markdown                 3.2.2
mesh-tensorflow          0.1.16
nltk                     3.5
numpy                    1.19.0
oauth2client             4.1.3
oauthlib                 3.1.0
opt-einsum               3.2.1
packaging                20.4
pandas                   1.0.5
pip                      20.1.1
portalocker              1.7.0
promise                  2.3
protobuf                 3.12.2
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pyparsing                2.4.7
python-dateutil          2.8.1
pytz                     2020.1
regex                    2020.6.8
requests                 2.24.0
requests-oauthlib        1.3.0
rouge-score              0.0.4
rsa                      4.6
sacrebleu                1.4.10
sacremoses               0.0.43
scikit-learn             0.23.1
scipy                    1.5.0
sentencepiece            0.1.91
setuptools               47.3.1.post20200622
six                      1.15.0
t5                       0.6.0
tensorboard              2.2.2
tensorboard-plugin-wit   1.6.0.post3
tensorflow               2.2.0
tensorflow-datasets      3.1.0
tensorflow-estimator     2.2.0
tensorflow-metadata      0.22.2
tensorflow-text          2.2.1
termcolor                1.1.0
tfds-nightly             3.1.0.dev202006250105
threadpoolctl            2.1.0
tokenizers               0.7.0
torch                    1.5.1
tqdm                     4.46.1
transformers             2.11.0
uritemplate              3.0.1
urllib3                  1.25.9
Werkzeug                 1.0.1
wheel                    0.34.2
wrapt                    1.12.1
zipp                     3.1.0
zope.event               4.4
zope.interface           5.1.0

agemagician commented 4 years ago

I think the error locates here:

All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: HTTP response code 503
         when resuming upload gs://xxxxx/large/model.ckpt-20_temp_42c0a8218a1547ba85e01e1f673e8502/
         [[node save/SaveV2_43 (defined at /site-packages/mesh_tensorflow/transformer/utils.py:720) ]]

How I can increase the number of retry attempts ?

adarob commented 4 years ago

@craffel have you ever seen this error on cloud?

craffel commented 4 years ago

I've never seen this error. 503 would be a service unavailable on your GCS bucket. Assuming this happens intermittently, I think your only options are to increase the number of retries and/or contact Google Cloud for support.

agemagician commented 4 years ago

Thanks a lot @adarob @craffel for your reply.

I don't know how to increase the number of retires. But, I have changed mesh TensorFlow code and so far I didn't get any error: https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/utils.py#L714

I have enabled "restore_sequentially" and disabled "sharded". For some reason when mesh TensorFlow tries to write many files it fails. In Google Colab it only writes 2 files but in the case of TPU Pod V3-512 it writes 64 different checkpoints.

craffel commented 4 years ago

Hmm, maybe your account has some kind of limit on how quickly you can do write operations to GCS and you were going over the limit or something. Glad you found a workaround.

craffel commented 3 years ago

I've reproduced this. @agemagician 's fix does seem to fix it, but makes checkpoint reading/writing extremely slow. Not sure if there is a good workaround, since tf.Saver is probably very deprecated/not recommended to be used.

agemagician commented 3 years ago

Hi @craffel ,

I have contacted the TPU team and the problem was fixed by switching to Tensorflow 2.4. Tensorflow 2.4 has fixed the checkpointing issue, there was a rate-limiting issue on the GCS integration side. Please, upgrade to Tensorflow to 2.4 or higher, and it should work.

I assume:

You were using either a special google account without this restriction.
You were using internal Google infrastructure.
You were training the TPUs using the attached servers, as the new TPU-VM.

Let me know if this solved your problem.

craffel commented 3 years ago

Unfortunately, at some point after TF 2.2 there was some internal change that results in OOMs (probably due to rematerialization, or lack thereof) for the models I'm now training. so I need to use TF 2.2, unless you also encountered this and found a workaround...

agemagician commented 3 years ago

Hmmm, no, the previous/current T5 models that I am training even the large billion models work fine with TF2.4. Sorry, that I could not help here.

google-research / text-to-text-transfer-transformer

11B model training on TPU V3-512 Crashes during training #280