Feature: Robust restoration of optimizer and scheduler

sadamov commented 1 month ago

Summary

This pull request introduces specific enhancements to the model loading and optimizer/scheduler state restoration functionalities, improving flexibility and compatibility with multi-GPU setups.

Detailed Changes

Enhanced Model Loading for Multi-GPU: Modified the model loading logic to better support multi-GPU environments by ensuring that optimizer states are only loaded when necessary and appropriate.
Checkpoint Adjustments: Adjusted how learning rate schedulers are restored from checkpoints to ensure they align correctly with the current training state

Impact

These changes provide users with greater control over how training states are restored and improve the script's functionality in distributed training environments.

Testing

[x] Changes have been tested in both single and multi-GPU setups

Notes

Further integration testing with different types of training configurations is recommended to fully validate the new functionalities.

joeloskarsson commented 1 month ago

I generally think this is a good change, but have not had opportunity to test it.

sadamov commented 1 month ago

I implemented the review and made sure that the code works again with latest main branch. This small change in how checkpoints are loaded, made it much more robust for multi-node training for me. I think we can merge this PR if your test runs successful as well @joeloskarsson

joeloskarsson commented 1 month ago

Looking good now! I tested it and seems to work without any issues, loading the correct state. Add an entry in the changelog for this and we can merge

sadamov commented 1 month ago

ok, changelog is updated. ready for merge @joeloskarsson

mllam / neural-lam