google-research / t5x

Apache License 2.0
2.71k stars 310 forks source link

Checkpointing got stuck on Google Cloud TPU #1439

Open XueFuzhao opened 1 year ago

XueFuzhao commented 1 year ago

When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints. For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue. May I know how to solve it? Thanks!

I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).

lintangsutawika commented 11 months ago

I have a similar issue. This seems to be quite recent.