mllam / neural-lam

Neural Weather Prediction for Limited Area Modeling
MIT License
64 stars 24 forks source link

Feature: Robust restoration of optimizer and scheduler #17

Closed sadamov closed 1 month ago

sadamov commented 1 month ago

Summary

This pull request introduces specific enhancements to the model loading and optimizer/scheduler state restoration functionalities, improving flexibility and compatibility with multi-GPU setups.

Detailed Changes

Impact

These changes provide users with greater control over how training states are restored and improve the script's functionality in distributed training environments.

Testing

[x] Changes have been tested in both single and multi-GPU setups

Notes

Further integration testing with different types of training configurations is recommended to fully validate the new functionalities.

joeloskarsson commented 1 month ago

I generally think this is a good change, but have not had opportunity to test it.

sadamov commented 1 month ago

I implemented the review and made sure that the code works again with latest main branch. This small change in how checkpoints are loaded, made it much more robust for multi-node training for me. I think we can merge this PR if your test runs successful as well @joeloskarsson

joeloskarsson commented 1 month ago

Looking good now! I tested it and seems to work without any issues, loading the correct state. Add an entry in the changelog for this and we can merge

sadamov commented 1 month ago

ok, changelog is updated. ready for merge @joeloskarsson