volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
553 stars 26 forks source link

[checkpoint] feat: open source fast checkpoint system #38

Closed MingjiHan99 closed 2 months ago

MingjiHan99 commented 2 months ago

Summary

We improved vescale.checkpoint with the following new features for fast checkpointing (where front three features are built-in techniques without necessitating manual activation):

Acknowledgement

We sincerely appreciate all contributors including but not limited to @shanesyy-1992 @raywan-110 @lazychao @AHEADer @MingjiHan99

shanesyy-1992 commented 2 months ago

From my understanding, Checkpoint Broadcasting might be beneficial only when the storage throughput is limited under certain circumstances. Maybe it's better to add some more guidance on when to use this feature.

raywan-110 commented 2 months ago

Let's keep pushing forward 💪!