instructlab / training

InstructLab Training Library
Apache License 2.0
9 stars 28 forks source link

Support ZeRO Stage 1 & 3 #26

Open RobotSail opened 1 month ago

RobotSail commented 1 month ago

Today we hardcode options specific to ZeRO stage 2. We should update our implementation to allow support for ZeRO stage 1 and 3 as well.

fabianlim commented 1 month ago

If we do this, then all the checkpointing flows need to be retested. Also not sure what is the impact on #25

RobotSail commented 1 month ago

Yes, we shouldn't do this until after the 15th. But probably something we should eventually support if we'd like NVMe offloading.