Checkpointing got stuck on Google Cloud TPU

When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints. For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue. May I know how to solve it? Thanks!

I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).

google-research / t5x

Checkpointing got stuck on Google Cloud TPU #1439