pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.29k stars 115 forks source link

Implement async_checkpoint #313

Closed fegin closed 1 month ago

fegin commented 1 month ago

Stack from ghstack (oldest at bottom):

Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue.

gnadathur commented 1 month ago

It would be good to add an integration test for async checkpoint cc: @fegin