volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
553 stars 26 forks source link

[checkpoint] Open Source #27

Closed MingjiHan99 closed 4 months ago

MingjiHan99 commented 4 months ago

In this PR, we open source our vescale.checkpoint, Yo. ~

vescale.checkpoint is a distributed LLM checkpointing system.

vescale.checkpoint offers simple and straightforward APIs, enabling users to load and save distributed model (DModule) and optimizer (DistributedOptimizer) seamlessly, abstracting away the complexities of underlying details such as process rank and device mesh.

vescale.checkpointsupports load-time checkpoint resharding when varying the degrees of data, tensor, or pipeline (TODO) parallelism for both veScale distributed model (DModule) and optimizer (DistributedOptimizer).

vescale.checkpoint incorporates fast checkpointing and various I/O optimization techinques, enhancing I/O efficiency during large language model training.

vescale.checkpoint will be a part of OmniStore project, a new open source project coming soon.

Credit to veScale Checkpoint Team

This endeavor would not have been possible without the contribution of veScale Checkpoint team which includes but not limited to: @shanesyy-1992 @MingjiHan99 @AHEADer @raywan-110 @michael4RD @lazychao @leochen-ai

Also thanks to the great guidance and leadership of: @pengyanghua @eric-haibin-lin @liwenchangbdbz @Meteorix

Credit to veScale Team

We would like to sincerely acknowledge the assistance of and collaboration with the veScale team which inlcudes but not limited to: @leonardo0lyj @JsBlueCat @MackZackA @Vremold @jc-bytedance @lichen225

Credit to PyTorch Distributed Checkpoint (DCP) Team

We would like to sincerely acknowledge the assistance of and collaboration with the PyTorch Distributed Checkpoint (DCP) team which includes but not limited to: @wz337 @kumpera @fegin @LucasLLC

MingjiHan99 commented 4 months ago

I have some questions abort protobuf, does we need that codegen file to be push?

Based on our discussion, we reserve the protobuf files for now. Otherwise, users have to generate code on their own.

shanesyy-1992 commented 4 months ago

Could you help make some clean with the fast checkpoint code. There seems to be some code that hasn't been used.

MingjiHan99 commented 4 months ago

Could you help make some clean with the fast checkpoint code. There seems to be some code that hasn't been used.

Sure. I will remove DistributedTorchLoader and RemappingTorchLoader.