Closed sadamov closed 1 month ago
I generally think this is a good change, but have not had opportunity to test it.
I implemented the review and made sure that the code works again with latest main branch. This small change in how checkpoints are loaded, made it much more robust for multi-node training for me. I think we can merge this PR if your test runs successful as well @joeloskarsson
Looking good now! I tested it and seems to work without any issues, loading the correct state. Add an entry in the changelog for this and we can merge
ok, changelog
is updated. ready for merge @joeloskarsson
Summary
This pull request introduces specific enhancements to the model loading and optimizer/scheduler state restoration functionalities, improving flexibility and compatibility with multi-GPU setups.
Detailed Changes
Impact
These changes provide users with greater control over how training states are restored and improve the script's functionality in distributed training environments.
Testing
[x] Changes have been tested in both single and multi-GPU setups
Notes
Further integration testing with different types of training configurations is recommended to fully validate the new functionalities.