instructlab / training

InstructLab Training Library - Efficient Fine-Tuning with Message-Format Data
https://pypi.org/project/instructlab-training/
Apache License 2.0
21 stars 45 forks source link

Support ZeRO Stage 1 & 3 #26

Open RobotSail opened 5 months ago

RobotSail commented 5 months ago

Today we hardcode options specific to ZeRO stage 2. We should update our implementation to allow support for ZeRO stage 1 and 3 as well.

fabianlim commented 5 months ago

If we do this, then all the checkpointing flows need to be retested. Also not sure what is the impact on #25

RobotSail commented 5 months ago

Yes, we shouldn't do this until after the 15th. But probably something we should eventually support if we'd like NVMe offloading.

github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.