openclimatefix / ocf_datapipes

OCF's DataPipe based dataloader for training and inference
MIT License
13 stars 11 forks source link

Add using Disk Cache Datapipe to save examples/batches to disk #178

Closed jacobbieker closed 9 months ago

jacobbieker commented 1 year ago

To speed up training, it could be useful to write some examples or batches to disk to load quickly. There are built-in datapipes that can be used to cache outputs from datapipes, so we could do it for only certain parts or the end torch Tensors.

Detailed Description

OnDiskCacheHolder EndDiskCacheHolder

Context

We still have an issue with loading speed of Zarr-based data. This could be used in a few ways. One would be generate a dataset to initially train models on, and then switch to training from the full pipeline as more fine-tuning, even though its slower. Or mix in loading examples off disk and loading them from raw data.

Possible Implementation

jacobbieker commented 9 months ago

We have different ways of doing this now, primarily in windnet_datapipe.