pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
2.29k stars 170 forks source link

Implement async_checkpoint #302

Closed fegin closed 4 months ago

fegin commented 5 months ago

Stack from ghstack (oldest at bottom):

Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue.