train directly from the raw GSP and PV data

train directly from the raw GSP and PV data.
Load all of GSP and PV (for the appropriate time period) into ram at init (I think they're small enough to fit into ram).
Then I can train the unet on 256 wide x 128 high
and try examples where half have PV data, half don't. Or maybe a third have no PV, a third have just PV (centred on one PV system), and a third have PV & GSP (although maybe don't start with this, as it sounds complicated!).
12 history steps (or more!).
Compute loss on whole 256x128 image, and add the loss on central 64x64 crop
Could compute an intermediate NWP Zarr with downsampled data.

Sketch of an implementation idea:

Implement this in a new sub-modules. e.g. power_perceiver.load_on_the_fly.
Have a main pytorch.utils.data.Dataset. We tell it which combinations of data we want, and the probability of loading each. e.g. something like
- data_loaders=dict(just_hrv=(HRVSatellite(),), hrv_and_pv=(PV(transforms=[Downsample()]), HRVSatellite()), hrv_and_pv_and_gsp=(GSP(), PV(transforms=[Downsample()]), HRVSatellite())) (where the first in the tuple is the one we use to select the locations)
- probabilities=dict(just_hrv=0.2, hrv_and_pv=0.2, hrv_and_pv_and_gsp=0.6)
At the start of each epoch, the Dataset asks each DataLoader for a list of available time periods. At init, the Dataset computes the intersection of those time periods for each combination. e.g., when using hrv_and_pv, it will not consider the time periods for gsp. We can re-use code from nowcasting_dataloader.
- For DataLoaders that only load a subset of days per epoch (such as satellite) we need to only use that subset of days when generating examples.
For each example, it randomly picks which type of example it will produce. It then asks the first DataLoader for that example type to select a random location (in osgb coords). The Dataset picks a random t0 (from the appropriate intersection of time periods). Each DataLoader receives the t0 and the location. We can re-use code from nowcasting_dataloader.
Each DataLoader has its own history_duration and forecast_duration (e.g. for the satellite we might want 1 hr of history and 4 hours of forecast. But for NWP we might want 1 day of hist, and 2 days of forecast).
Need to be able to use the existing transforms and xr_batch_processors and np_batch_processors.
Need to be able to use the pre-prepared validation batches.
Re-use the existing power_perceiver.DataLoader.to_numpy code if possible
Need to tell the ML model which examples are which, so we can mask the queries & loss.
Also need to fill in "dummy data" when a data source is missing.

TODO:

[x] Copy GSP and PV Zarrs from leonardo to my desktop so I can test against the data
[x] Copy GSP and PV Zarrs and NWP Zarr from leonardo to GCP (first check if GCP already has what's needed)
[x] Start sketching out the pytorch.utils.data.Dataset and the superclass for the DataLoader. Use the same method names and signatures as nowcasting_dataset whenever possible, to make it easier to merge this new code into nowcasting_dataset if needs be.
[x] Think through the business logic before actually writing any code!
[x] move satellite_zarr_dataset.py to load_on_the_fly/
[x] #66
[x] Rename data_loader to data_sources; and DataLoader to DataSource (to be more consistent with nowcasting_dataset
[ ] Do we want a main DataSource superclass, which PreparedDataSource and RawDataSource inherit from? If nothing else, this would allow us to define XarrayBatch properly!
[x] #67
[x] #68

In RawDataset._get_example:

[x] In _get_xarray_example: Loop round the other data sources calling get_empty_example().
[ ] Tell the ML model which type of "combo" this is. (actually, I don't think that's necessary? In the ML model objective function, we can just see which examples are NaN?)
[ ] #78
[ ] (or, if I go off doing #78 for some reason:) Now that we're always using different RawDataSource instances in RawDataset, we can probably simplify some of the code in RawDataset. We probably get get rid of _unique_data_sources. We can just use the data sources from each combo, without having to check if it's unique, because we now guarantee that each combo will have its own instances. Although maybe we should assert that in the _sanity_check_args.
[x] Write a unit test which builds a RawDataset with sat_only and gsp_sat_pv combos, and the batch processors I'm planning to use.

openclimatefix / power_perceiver

train directly from the raw GSP and PV data #65