rutkovskii commented 2 weeks ago

Pull Request

Description

This pull request introduces several enhancements and fixes to the DGMR project, focusing on optimization, logging, and data processing. The key updates include:

Memory Optimization:
- Replaced direct calls to self.forward with torch.utils.checkpoint.checkpoint to enable gradient checkpointing and reduce memory consumption during training. Added by colleague – @xuzhe951024
Improved Logging:
- Removed depricated logger checks in run.py and restructured logger initialization for simplicity to prevent initialization of multiple loggers in multigpu environemnt..
Data Loading Enhancements:
- Updated TFDataset initialization to include trust_remote_code for compatibility with remote dataset loading.
- Added configurable batch size, enabling dynamic adjustments during training.
Code Cleanup:
- Consolidated the __main__ block for better readability and modularity.
- Added default values for batch size and streamlined DataLoader creation.
Dependencies:
- Added wandb, datasets, and tensorflow to requirements.txt to support new functionalities.

Fixes # (Include the relevant issue ID if applicable)

How Has This Been Tested?

The changes were tested using the following methods:

Training Runs:
- Ran fast_dev_run successfully on MacBook Pro M2.
- Ran full training for 4 days on 2 nodes and 4 GPUs per node (8 total) using DDP strategy using generation step = 6, batch size = 16, and precision = 32 on NVidia A100-80GB.
Data Pipeline: Confirmed that the modified DataLoader processes batches correctly and handles remote datasets without errors.

Steps to reproduce:

Set up the environment using the updated requirements.txt.
Run the run.py script with the default configuration.
Monitor Wandb logs for training metrics and validate output consistency.

Have you plotted any changes?

[ ] Yes

Checklist:

[x] My code follows OCF's coding style guidelines
[x] I have performed a self-review of my own code
[x] I have made corresponding changes to the documentation
[ ] I have added tests that prove my fix is effective or that my feature works
[x] I have checked my code and corrected any misspellings

rutkovskii commented 2 weeks ago

@jacobbieker Hi Jacob, Here is the comment on this PR. https://github.com/openclimatefix/skillful_nowcasting/issues/59#issuecomment-2486632896

rutkovskii commented 2 weeks ago

@jacobbieker Glad to help! I believe only you can merge it into the main branch from here.

rutkovskii commented 1 week ago

@jacobbieker would it be possible to add me to the list of contributors?

I am also looking to cite this repository in my thesis, and additing the CITATION.cff file could be useful for others who would be citing your work in the future. https://citation-file-format.github.io/ https://citation-file-format.github.io/cff-initializer-javascript/#/

jacobbieker commented 1 week ago

@jacobbieker would it be possible to add me to the list of contributors?

I am also looking to cite this repository in my thesis, and additing the CITATION.cff file could be useful for others who would be citing your work in the future. https://citation-file-format.github.io/ https://citation-file-format.github.io/cff-initializer-javascript/#/

Yes, of course! The comment above should trigger the bot. I've also added a CITATION.cff file now too, so hopefully that helps!

jacobbieker commented 1 week ago

@all-contributors please add @rutkovskii for code

allcontributors[bot] commented 1 week ago

@jacobbieker

I've put up a pull request to add @rutkovskii! :tada:

rutkovskii commented 1 week ago

Thank you very much!

openclimatefix / skillful_nowcasting

Gradient Checkpointing, Improved Logging, and Data Pipeline Updates #77

Pull Request

Description

How Has This Been Tested?

Steps to reproduce:

Have you plotted any changes?

Checklist: