When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints.
For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue.
May I know how to solve it? Thanks!
I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).
When training on large models (> 10B or so), we found the checkpointing sometimes got stuck when saving the checkpoints. For instance, it may work smoothly for saving checkpoints a few times and then suddenly got stuck in one checkpointing process. I found the checkpoints in the /tmp/ dir is incomplete. It seems to be a memory leakage issue. May I know how to solve it? Thanks!
I'm using Google Cloud Platform with 1024 TPUv3 cores (512 TPU v3 chips).