Closed JackKelly closed 2 years ago
https://github.com/openclimatefix/nowcasting_dataset/issues/213#issuecomment-939807558 from big issue
Idea is to use optional requirements for pytorch
If it's OK, I'll keep this issue open until we've removed the pytorch dataloader and pytorch lightning from the "batch pre-processing" code :)
sure thing, where is that?
Tthe specific places where pytorch / pytorch lightning are still used are:
NowcastingDataModule
inherits from pl.LightningDataModule
. I think we can remove the dependency from pl.LightningDataModule
and have NowcastingDataModule
inherit from nothing.
NowcastingDataModule.train_dataloader()
, val_dataloader()
, and test_dataloader()
all return torch.utils.data.DataLoader
objects.NowcastingDataset
inherits from torch.utils.data.IterableDataset
.I might strip out these PyTorch
things in one of the sub-steps of #202 (but I haven't fully thought this through!)
Maybe its not quite closed, would be good to remove it from requirements too, ill have a go at this
Oops, you're exactly right, sorry - this issue should still be open!
FWIW, these are the lines where "torch" is still mentioned in our code:
When I started
nowcasting_dataset
, the intention was to usenowcasting_dataset
to generate batches on-the-fly during ML training from separate Zarr stores for the satellite data, NWPs, and PV. But that turned out to be too slow and fragile :) So, we swapped to usingnowcasting_dataset
to pre-prepare batches ahead-of-time, and save them to disk. During ML training, we just need to load the batches from disk, and we're good-to-go. (Pre-preparing batches has a number of other advantages, too).But, this development history means that
nowcasting_dataset
still uses PyTorch (e.g. using the PyTorch DataLoader to run multiple processes). The code may become cleaner and faster and more flexible if we strip out PyTorch, and instead (maybe) useconcurrent.futures.ProcessPoolExecutor
to use multiple processes.TODO: