Closed agemagician closed 4 years ago
I have tried both Tensorflow 2.2 and Tensorflow 1.5.13, the problem exist in both of them.
I have tried mesh-tensorflow 0.1.13 and 0.1.16, the problem exist in both of them.
This is my current running command:
python -m t5.models.mesh_transformer_main \
--module_import="xxx_task" \
--tpu="node-1" \
--gcp_project="xxx" \
--tpu_zone="europe-west4-a" \
--model_dir="gs://xxx/11b/" \
--gin_file="objectives/span_3_15_u_u.gin" \
--gin_file="models/t5.1.0.11B.gin" \
--gin_file="dataset.gin" \
--gin_file="learning_rate_schedules/rsqrt_no_ramp_down.gin" \
--gin_param="MIXTURE_NAME = 'task_xxx'" \
--gin_param="utils.tpu_mesh_shape.tpu_topology = '16x16'" \
--gin_param="utils.tpu_mesh_shape.model_parallelism = 32" \
--gin_param="utils.run.save_checkpoints_steps=2000" \
--gin_param="utils.run.batch_size=('tokens_per_batch', 1048576)" \
--gin_param="utils.run.train_steps=1000000" \
--gin_param="utils.run.iterations_per_loop=100" \
--gin_param="learning_rate_schedule_noam.warmup_steps=10000" \
--gin_param="SentencePieceVocabulary.extra_ids=100" \
--gin_param="run.perplexity_eval_steps=100"
I have also tested it with T5 Large version and the same error occurs. The problem is just randomly occurs when it tries to save the model.
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also
occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job
. Error: From /job:worker/replica:0/task:1:
All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: HTTP response code 503
when resuming upload gs://xxxxx/large/model.ckpt-60_temp_838179b3efeb4ecb90a79af73c170014/
[[node save/SaveV2_1 (defined at /site-packages/mesh_tensorflow/transformer/utils.py:720) ]]
Errors may have originated from an input operation.
Input Source operations connected to node save/SaveV2_1:
encoder/block_019/layer_000/SelfAttention/o_slot_vr/Read/ReadVariableOp (defined at /site-packages/mesh_tensorflow/ops.py:4020)
Original stack trace for 'save/SaveV2_1':
File "/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/site-packages/t5/models/mesh_transformer_main.py", line 240, in <module>
console_entry_point()
File "/site-packages/t5/models/mesh_transformer_main.py", line 237, in console_entry_point
app.run(main)
File "/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/site-packages/t5/models/mesh_transformer_main.py", line 231, in main
model_dir=FLAGS.model_dir)
File "/site-packages/gin/config.py", line 1055, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/site-packages/mesh_tensorflow/transformer/utils.py", line 2115, in run
train_dataset_fn, train_steps, ensemble_inputs)
File "/site-packages/mesh_tensorflow/transformer/utils.py", line 1498, in train_model
estimator.train(input_fn=input_fn, max_steps=train_steps)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3078, in train
saving_listeners=saving_listeners)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1182, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1211, in _train_model_default
self.config)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2915, in _call_model_fn
config)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1170, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3206, in _model_fn
_train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3648, in _train_on_tpu_system
device_assignment=ctx.device_assignment)
File "/site-packages/tensorflow/python/tpu/tpu.py", line 1565, in split_compile_and_shard
name=name)
File "/site-packages/tensorflow/python/tpu/tpu.py", line 1280, in split_compile_and_replicate
outputs = computation(*computation_inputs)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3634, in multi_tpu_train_steps_on_single_shard
inputs=[0, _INITIAL_LOSS])
File "/site-packages/tensorflow/python/tpu/training_loop.py", line 178, in while_loop
condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2766, in while_loop
return_same_structure)
File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2248, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2173, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/site-packages/tensorflow/python/tpu/training_loop.py", line 121, in body_wrapper
outputs = body(*(inputs + dequeue_ops))
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3633, in <lambda>
lambda i, loss: [i + 1, single_tpu_train_step(i)],
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1753, in train_step
self._call_model_fn(features, labels))
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2031, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/site-packages/mesh_tensorflow/transformer/utils.py", line 720, in my_model_fn
save_relative_paths=True)
File "/site-packages/tensorflow/python/training/saver.py", line 836, in __init__
self.build()
File "/site-packages/tensorflow/python/training/saver.py", line 848, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/site-packages/tensorflow/python/training/saver.py", line 886, in _build
build_restore=build_restore)
File "/site-packages/tensorflow/python/training/saver.py", line 507, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/site-packages/tensorflow/python/training/saver.py", line 299, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/site-packages/tensorflow/python/training/saver.py", line 273, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
File "/site-packages/tensorflow/python/training/saver.py", line 206, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/site-packages/tensorflow/python/training/saver.py", line 122, in save_op
tensors)
File "/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1717, in save_v2
name=name)
File "/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/site-packages/tensorflow/python/framework/ops.py", line 3327, in _create_op_internal
op_def=op_def)
File "/site-packages/tensorflow/python/framework/ops.py", line 1791, in __init__
self._traceback = tf_stack.extract_stack()
In case its helpful:
pip list
Package Version
------------------------ ---------------------
absl-py 0.9.0
astunparse 1.6.3
attrs 19.3.0
Babel 2.8.0
boto 2.49.0
cachetools 4.1.0
certifi 2020.6.20
chardet 3.0.4
click 7.1.2
dill 0.3.2
distro 1.5.0
filelock 3.0.12
future 0.18.2
gast 0.3.3
gevent 20.6.2
gin-config 0.3.0
google-api-core 1.21.0
google-api-python-client 1.9.3
google-auth 1.18.0
google-auth-httplib2 0.0.3
google-auth-oauthlib 0.4.1
google-cloud-core 1.3.0
google-cloud-storage 1.29.0
google-compute-engine 2.8.13
google-pasta 0.2.0
google-resumable-media 0.5.1
googleapis-common-protos 1.52.0
greenlet 0.4.16
grpcio 1.30.0
h5py 2.10.0
httplib2 0.18.1
idna 2.9
importlib-metadata 1.6.1
joblib 0.15.1
Keras-Preprocessing 1.1.2
Markdown 3.2.2
mesh-tensorflow 0.1.16
nltk 3.5
numpy 1.19.0
oauth2client 4.1.3
oauthlib 3.1.0
opt-einsum 3.2.1
packaging 20.4
pandas 1.0.5
pip 20.1.1
portalocker 1.7.0
promise 2.3
protobuf 3.12.2
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2020.1
regex 2020.6.8
requests 2.24.0
requests-oauthlib 1.3.0
rouge-score 0.0.4
rsa 4.6
sacrebleu 1.4.10
sacremoses 0.0.43
scikit-learn 0.23.1
scipy 1.5.0
sentencepiece 0.1.91
setuptools 47.3.1.post20200622
six 1.15.0
t5 0.6.0
tensorboard 2.2.2
tensorboard-plugin-wit 1.6.0.post3
tensorflow 2.2.0
tensorflow-datasets 3.1.0
tensorflow-estimator 2.2.0
tensorflow-metadata 0.22.2
tensorflow-text 2.2.1
termcolor 1.1.0
tfds-nightly 3.1.0.dev202006250105
threadpoolctl 2.1.0
tokenizers 0.7.0
torch 1.5.1
tqdm 4.46.1
transformers 2.11.0
uritemplate 3.0.1
urllib3 1.25.9
Werkzeug 1.0.1
wheel 0.34.2
wrapt 1.12.1
zipp 3.1.0
zope.event 4.4
zope.interface 5.1.0
I think the error locates here:
All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: HTTP response code 503
when resuming upload gs://xxxxx/large/model.ckpt-20_temp_42c0a8218a1547ba85e01e1f673e8502/
[[node save/SaveV2_43 (defined at /site-packages/mesh_tensorflow/transformer/utils.py:720) ]]
How I can increase the number of retry attempts ?
@craffel have you ever seen this error on cloud?
I've never seen this error. 503 would be a service unavailable on your GCS bucket. Assuming this happens intermittently, I think your only options are to increase the number of retries and/or contact Google Cloud for support.
Thanks a lot @adarob @craffel for your reply.
I don't know how to increase the number of retires. But, I have changed mesh TensorFlow code and so far I didn't get any error: https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/utils.py#L714
I have enabled "restore_sequentially" and disabled "sharded". For some reason when mesh TensorFlow tries to write many files it fails. In Google Colab it only writes 2 files but in the case of TPU Pod V3-512 it writes 64 different checkpoints.
Hmm, maybe your account has some kind of limit on how quickly you can do write operations to GCS and you were going over the limit or something. Glad you found a workaround.
I've reproduced this. @agemagician 's fix does seem to fix it, but makes checkpoint reading/writing extremely slow. Not sure if there is a good workaround, since tf.Saver is probably very deprecated/not recommended to be used.
Hi @craffel ,
I have contacted the TPU team and the problem was fixed by switching to Tensorflow 2.4. Tensorflow 2.4 has fixed the checkpointing issue, there was a rate-limiting issue on the GCS integration side. Please, upgrade to Tensorflow to 2.4 or higher, and it should work.
I assume:
Let me know if this solved your problem.
Unfortunately, at some point after TF 2.2 there was some internal change that results in OOMs (probably due to rematerialization, or lack thereof) for the models I'm now training. so I need to use TF 2.2, unless you also encountered this and found a workaround...
Hmmm, no, the previous/current T5 models that I am training even the large billion models work fine with TF2.4. Sorry, that I could not help here.
Hello,
We have started large scale training for the 11B model on TPU V3-512, but the model keeps crashing and trying to recover during training:
The error usually occurs when it tries to save a new checkpoint, when it happens it doesn't store the checkpoint and it reloads the previous checkpoint.
I also notice the loss is heavily affected when this issue occurs, the blue line is the base version trained on Colab pro and the green line is the 11B version trained on the TPU pod.
And ideas what could be the cause of this problem and how to overcome it ?
@adarob @craffel @sharannarang @nshazeer , Your feedback is highly appreciated.