Try Flamingo / contrastive learning approach to solar PV forecasting

JackKelly commented 2 years ago

https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model

Motivations

We want a solar PV forecasting model which:
- Can handle large inputs (e.g. the whole of Great Britain)
- Learns to associate clouds with changes in PV power generation
- Use live PV data from 2,000 PV systems (across the whole country: Passiv + PVOutput.org)
I'm perhaps becoming sceptical of the value of predicting full satellite images. These predictions are, perhaps necessarily, just too blurry (because it is actually really hard to predict clouds at high temporal and spatial resolution!). But we still want to be able to pre-train our model on the full extent of satellite imagery. Imagine asking a human to predict the movement of clouds from an hour of satellite images. They'd probably to a pretty good job of saying "well, this big cloud is clearly moving this way; and these other clouds are forming as warm moist air above the ocean is pushed over land, etc.". But the human would not do a great job of create a pixel-perfect "dense" prediction. But, crucially, for PV forecasting we probably don't need a pixel-perfect prediction. We "just" need high-level, abstract features, that are useful for predicting PV, like "there's a big dark cloud coming".

Implementation:

Pre-training the video encoder:

Two encoders: One which encodes short sequences of satellite images (maybe 3 images, 15 mins apart?), and one which encodes short sequences of PV power timeseries (and GSP PV power, for all GSPs in the region of interest?). Use large patches of imagery, maybe like 512x512, so we can predict the entire country easily.
use contrastive predictive coding (Hénaff et al. 2020; van den Oord et al. 2019; and see this video explanation of Hénaff et al. 2020) to encourage the satellite encoder to capture information that is maximally useful to predict the future latent representation of the centre crop of satellite imagery (this is useful to train on areas with no PV data). Maybe compute how informative the encoder is for all timesteps of interest (e.g. compute across latent representations computed for 0-30 mins; 30-60 mins; 60-90 mins; etc.). Also encourage the PV encoder to capture information that is maximally useful to predict a central crop of future PV. And, crucially, encourage the satellite encoder to capture information that is maximally useful to predict future PV latent state of a central crop of PV.
Maybe also train an NWP encoder. Which we train using contrastive predictive encoding, to be maximally informative of PV at the NWP's target timestep (and predictive of the satellite encoder's latent representation?! So we can train over the oceans?!). Training an NWP encoder to be maximally predictive of a latent representation of satellite imagery might be more tractable than trying to directly predict satellite pixels. The aim is to learn a better parameterisation of clouds than the NWP's parameterisation.
Video encoder: NFNet (Normlisation-Free ResNet) feeding into a Perceiver (see openclimatefix/power_perceiver#154)? (as per Flamingo). Encode the output of the NFNet with the absolute spatial encoding (absolute so it still works when multiple patches are used together) and the GSP ID? These position encodings are important so the model has a chance at being informative for PV and GSP power at specific locations.
Also train a model (a Temporal Fusion Transformer?) on PV (and NWPs?). (Perhaps pre-process each PV timeseries with an RNN?)

Use the video encoder to predict PV:

Freeze (or don't freeze?) the "video encoder" and "PV encoder", and feed their representations of the recent history (using many hours of history, including, perhaps, all of yesterday) into a new ML model which predicts just future PV.
Maybe this new model is a Perceiver? (see issue openclimatefix/power_perceiver#154)? That we query to ask for the PV power predictions.
Or, perhaps the new model is a Transformer Decoder which forecasts PV (perhaps use a new Temporal Fusion Transformer)
Or, more like Flamingo, have both a Perceiver and a pre-trained Transformer Decoder; and perhaps use tanh gates (initialised at zero) to stabilise training.

To predict national PV (and demand?) use several encoders in parallel, in non-overlapping patches, to "see" the entire country? Feed these all into a single "future decoder". Perhaps need to use a hierarchical Perceiver.

Why not just directly predict future satellite imagery? Because I think it's too onerous to predict individual pixels, and we don't need individual pixels. We just need to extract a representation from the satellite that is informative of future PV.

bndxn commented 2 years ago

Hey Jack, I've been looking over this as part of my project. It's interesting to see this discussion while I'm trying to do something similar. Earlier you said:

I'm perhaps becoming sceptical of the value of predicting full satellite images

Can I ask what the motivation was for forecasting full satellite images in the first place, assuming the eventual goal is always to forecast PV?

JackKelly commented 2 years ago

Hi @bndxn! The main motivation for predicting satellite imagery was to allow us to pre-train the "satellite predictor" on any rectangle of satellite imagery, even locations with no PV systems (eg over the ocean!). This way, we hoped to get the model to learn cloud dynamics from as much data as possible. The idea is that predicting the movement of clouds is hard, so we want to train the model on as much data as possible.

I still very much believe in the general principle that we want to pre-train part of the ML model on as much satellite imagery as possible! But I now suspect that predicting pixel-by-pixel images is just too onorous. So it might be better to learn an encoder whose latent representation is maximally informative of the latent representation of future satellite imagery

bndxn commented 2 years ago

Thanks for getting back to me. Cool, that makes a lot of sense. I'm guessing when there's no PV data, the labels/target is a prediction of the same rectangle, is that right?

I can see how this approach would help generate good predictions of cloud patterns, and I can see how big cloudy patches would have a direct link to PV. Are there some nuances in clouds that this approach would help for, but that don't matter very much from a PV perspective? For example, maybe the satellite imagery gets very good at predicting lots of fine thin cloud vs thin ripply cloud (I am not a cloud expert!) but actually from a PV yield view these two are the same - does that make sense?

-- Ben

JackKelly commented 2 years ago

That's a good example!

My hunch is that the problem is that it's really hard to predict clouds on a pixel-by-pixel basis. So we're perhaps hurting the model by asking it to do something that's (almost) impossible! So we might be better off predicting high-level, abstract features of the satellite imagery, so we can still learn how clouds evolve from all the available satellite data.

bndxn commented 2 years ago

Cool, thanks this is really helpful! Is this something that can be captured well with a probabilistic forecast? In this case, does the chain go something like this:

Input satellite images and other data
Create an encoding of the cloud patterns
Generate probabilistic forecasts of the image
Use the probabilistic forecast to make a distribution of PV yield forecasts?

JackKelly commented 2 years ago

Yes, that looks right!

Although I'm proposing that we don't do Step 3. i.e. the model would never create an explicit pixel-wise prediction of satellite imagery. Instead the "satellite encoder" would be trained to produce latent representations that are maximally informative for future PV (and the future latent state of the satellite encoder) :slightly_smiling_face:

Please do shout if you're interested in trying any (or all!) or this!

JackKelly commented 2 years ago

PyTorch implementation of Contrastive Predictive Coding: https://github.com/rschwarz15/CPCV2-PyTorch

(Hat tip to @jacobbieker! Thanks Jacob!)

JackKelly commented 2 years ago

(Actually, my new favourite idea is described in openclimatefix/psuedo-pv-labeller#1! I plan to try openclimatefix/psuedo-pv-labeller#1 before I try contrastive learning. openclimatefix/psuedo-pv-labeller#1 feels like it has many of the same advantages, without some of the complexity of contrastive learning)

bndxn commented 2 years ago

Cool, that sounds really interesting. At the moment I'm working with a subset of the data and doing a ConvLSTM to predict PV directly going forwards.

My plan is to compare this with a simplified version of OCF's CNN in production (maybe only taking sat images and the solar altitude as inputs). The idea is to compare explicitly how important the explicit representation of time series in ConLSTMs is, compared to a 3D CNN.

If I can wrangle that in the next few weeks, then maybe I'll include a comparison with an attention model. If I finish all of that before September, then yeah sure! This will all provide helpful context anyway.

JackKelly commented 2 years ago

Sounds awesome! Please do let us know how it goes! Very exciting stuff!

openclimatefix / power_perceiver

Try Flamingo / contrastive learning approach to solar PV forecasting #155

Motivations

Implementation: