Remove PyTorch from the code

openclimatefix / nowcasting_dataset

Prepare batches of data for training machine learning solar electricity nowcasting data

https://nowcasting-dataset.readthedocs.io/en/stable/

MIT License

24 stars 6 forks source link

Remove PyTorch from the code #86

Closed JackKelly closed 2 years ago

JackKelly commented 3 years ago

When I started nowcasting_dataset, the intention was to use nowcasting_dataset to generate batches on-the-fly during ML training from separate Zarr stores for the satellite data, NWPs, and PV. But that turned out to be too slow and fragile :) So, we swapped to using nowcasting_dataset to pre-prepare batches ahead-of-time, and save them to disk. During ML training, we just need to load the batches from disk, and we're good-to-go. (Pre-preparing batches has a number of other advantages, too).

But, this development history means that nowcasting_dataset still uses PyTorch (e.g. using the PyTorch DataLoader to run multiple processes). The code may become cleaner and faster and more flexible if we strip out PyTorch, and instead (maybe) use concurrent.futures.ProcessPoolExecutor to use multiple processes.

TODO:

[x] Remove Datasets and Datamodule. Done in PR #307
[x] Remove pytorch lightning from requirements.txt and environment.yaml. Done in PR #307.
[ ] Remove torch from other python files
[ ] Remove torch from requirements.txt and environment.yaml

peterdudfield commented 2 years ago

https://github.com/openclimatefix/nowcasting_dataset/issues/213#issuecomment-939807558 from big issue

Idea is to use optional requirements for pytorch

JackKelly commented 2 years ago

If it's OK, I'll keep this issue open until we've removed the pytorch dataloader and pytorch lightning from the "batch pre-processing" code :)

peterdudfield commented 2 years ago

sure thing, where is that?

JackKelly commented 2 years ago

Tthe specific places where pytorch / pytorch lightning are still used are:

NowcastingDataModule inherits from pl.LightningDataModule. I think we can remove the dependency from pl.LightningDataModule and have NowcastingDataModule inherit from nothing.
- NowcastingDataModule.train_dataloader(), val_dataloader(), and test_dataloader() all return torch.utils.data.DataLoader objects.
- NowcastingDataset inherits from torch.utils.data.IterableDataset.

I might strip out these PyTorch things in one of the sub-steps of #202 (but I haven't fully thought this through!)

peterdudfield commented 2 years ago

linked with - https://github.com/openclimatefix/nowcasting_dataset/issues/315

peterdudfield commented 2 years ago

Maybe its not quite closed, would be good to remove it from requirements too, ill have a go at this

JackKelly commented 2 years ago

Oops, you're exactly right, sorry - this issue should still be open!

JackKelly commented 2 years ago

FWIW, these are the lines where "torch" is still mentioned in our code: