[Feature] Asyncronous Serialization

Move checkpoints from device memory to host memory asynchronously, and write to disk in the background => not blocking the training

In the second stage, a background process takes over, asynchronously transferring the state from the host mem- ory to a distributed file system (HDFS in our deployment) for centralized maintenance. This decoupling of operations into two stages allows the GPU workers to resume training almost immediately after dumping their state, while the more time-consuming process of writing to HDFS is offloaded to a separate, non-blocking process.

Reference: MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 7

huggingface / nanotron

[Feature] Asyncronous Serialization #87