openclimatefix / power_perceiver

Machine learning experiments using the Perceiver IO model to forecast the electricity system (starting with solar)
MIT License
7 stars 1 forks source link

Load multiple prepared datasets #6

Open JackKelly opened 2 years ago

JackKelly commented 2 years ago

Detailed Description

A large part of my hope for the ML research we're doing in 2022 is to train across multiple "types" of prepared dataset. For example:

Context

To train our models to predict future satellite imagery, we probably want to use the entire geographical extent of the satellite imagery.

But we also want to predict PV in the UK, Italy and Malta.

so we might want each batch to contain a mix of examples: some examples will be from the UK (as is the case now), and some examples from anywhere in the geo extent of the satellite imagery (including over oceans) without any PV.

at the moment, nowcasting_dataset can't do this "mixture".

The simplest way to do this might actually be to leave nowcasting_dataset mostly alone, and produce multiple different sets of batches (one set over the UK; the other set without PV data, and from the entire geo extent of the imagery). Then power_perceiver will load multiple batches at once. This has the advantage that we can quickly experiment with dynamically changing the ratio of "UK" to "non-UK" imagery as training progresses.

But this simpler approach still requires that we update nowcasting_dataset a bit (e.g. to randomly sample locations from the entire geo extent of the satellite imagery.)

Possible Implementation

Maybe implement a thin adaptor which holds multiple power_perceiver.NowcastingDataset instances, and itself inherits from torch.utils.data.Dataset. This thin adaptor would sample randomly sample from the upstream power_perceiver.NowcastingDataset instances and stack the Tensors. So for example, if we're combining "just satellite" data and "satellite + PV + GSP + NWP" then, say, the first 16 examples in each batch would be "just satellite", and the first 16 examples for PV, GSP, and NWP would be zeros (and would be masked out before it goes into the Perceiver).

peterdudfield commented 2 years ago

Sounds really great, and good to get it planned out.

I would love to get https://github.com/openclimatefix/nowcasting_dataset/pull/562 into the new datasets, but I unfortunately Ive run out of time before my holiday