Pseudo-label satellite imagery using an ML model. Then predict future PV using those labels

JackKelly commented 2 years ago

We've known for a little while that an ML model can do surprisingly well at inferring PV power at time t from satellite imagery at time t (and even better from a handful of satellite images, centred on time t.)

OpenAI's new paper

In OpenAI's new paper on "learning to play minecraft with video pre-training", the authors train a model to infer actions from minecraft videos. (Yannic Kilcher has a great video on the paper.) This model is a 3D CNN -> ResNet -> transformer. Crucially, this model sees the future. Which makes it much more reliable at inferring actions. The authors use this model to pseudo-label tens of thousands of hours of minecraft videos from YouTube. The authors use these pseudo-labels to pre-train another model to actually play minecraft, and finally this model is fine-tuned on human-labelled data.

Pseudo-labelling satellite images with a "latent embedding of irradiance".

Maybe we could do something similar. We train a model (perhaps a 3D CNN -> ResNet -> transformer) to infer PV power at time t from satellite images (multiple channels) centred on time t (and NWPs?). We could use this model to pseudo-label satellite imagery for which we have no PV power data. The end result could be a gridded dataset of PV power, across the entire temporal and spatial extent of the satellite imagery. This would give us many orders of magnitude more data to train a "PV predictor" on!

(Google Draw link to this diagram)

One detail not shown above is that perhaps we could include HRV in the same CNN stack as the other satellite channels by first downsampling HRV with a 2D CNN?

Another thought not shown in the diagram is that perhaps we should provide the model with PV data from neighbouring PV systems (see the comment below).

A proposal for how to handle the extreme variability between different PV systems

The power output can vary dramatically between different, geographically close PV systems. PV systems are very heterogeneous. Differences include tilt, angle, age, shading, efficiency, soiling, whether the PV panels are oversized compared to the inverter, whether the "PV system" is made up of PV panels in more than one plane, etc.

This makes life a little tricky for the labeller. How do we create a set of labels that are representative of all PV systems?

My original idea was to cluster PV systems into, say, 16 different PV "types". And then create 16 different gridded pseudo-labelled datasets, one for each PV "type".

But I now think a much more elegant solution is to build a model architecture that reflects what happens in the physical reality: The atmosphere affects the amount of irradiance that arrives at the PV system; and the PV system transforms that irradiance into electrical power. It's a two-step, strictly serial process. We could ask our labeller to predict the irradiance just above the PV system (independent of the specifics of each PV system). And then we could have a separate little ML module which maps from irradiance to PV power, given each PV system's properties.

The problem, of course, is we don't have measurements of the irradiance just about each PV system! Irradiance datasets do exist, but they have several orders of magnitude less sensors than there are PV systems, so we're ignoring irradiance datasets for now.

Instead, we can try to infer a "latent proxy for irradiance" by deliberately not telling the majority of the model anything about the specifics of each PV system (other than the PV location). Hopefully this will force the model to learn to infer something like the irradiance. It won't actually be the irradiance. But it should be a concept that is vaguely similar :slightly_smiling_face:.

During training and validation, we'll use the little "'latent proxy for irradiance' -> 'PV power'" ML module to predict PV power. But, when we come to create our gridded "psuedo-labelled" dataset, we'd skip that step and save the "latent proxy for irradiance" to disk. And our "PV predictor" would directly predict this "latent proxy for irradiance", and use the pre-trained "'latent proxy for irradiance' to 'PV power'" module to get PV predictions.

Another advantage of this approach is that, in production, when new users turn up with data for a new PV system, we can start creating predictions immediately (by feeding their PV system's azimuth & tilt into the "'latent proxy for irradiance' to 'PV power'" module). And then, perhaps just an hour or so later, we can give them even better predictions just by re-training the "'latent proxy for irradiance' to 'PV power'" module using their data. We wouldn't have to retrain the entire system (although we'd probably want to retrain the entire system every week, say).

Questions / experiments for the "labeller".

Following the OpenAI paper, maybe the "labeller" should actually predict short sequences of PV power in each forward pass. Because satellite imagery might be most useful for helping predict the shape of PV power, rather than the absolute values. Because it's clear that a big dark cloud will reduce PV power, but it might not be obvious from the satellite image exactly what the absolute values will be. And I guess we'd need a loss function that cares about both the MSE of each prediction in isolation, and the correlation of the predicted and actual timeseries.
NWPs: We know that PV efficiency changes up to 10% depending on panel temperature. So it's likely that the labeller would benefit from getting air temperature (at the Earth's surface) and wind speed. And maybe visibility, precipitation, snow depth, and others parameters. The full gridded NWPs could be fed into the model via a separate stack of 3D CNN -> ResNet. Or, alternatively, try including the NWP variables at each PV query location with each PV query.
can we test on the training set?! When we create the gridded pseudo labels, can we train on the entire dataset (except a little bit held out to ensure we're not overfitting), and then run inference over the entire dataset? Normally that would be a terrible sin. But in this context, I have a hunch that it's actually a good idea to test on the training set??

Predicting PV for individual sites (pre-training on the psuedo-labelled data)

(Google Draw link)

We then train a similar-architecture model to predict future PV (not future imagery) using these pseudo-labels, in a supervised fashion.

Specifically, the solar PV forecasting model would, at a minimum, receive as inputs:

History:
- Satellite images for the recent history (fed through a 3D CNN -> ResNet). Some questions:
  - Can we include high res visible imagery in the same CNN as the rest of the (lower res) satellite channels by down-sampling the HRV with one or two layers of CNN?)
  - Can we use different sample periods? e.g. We'd use 15-minutely images for the last hour; and then hourly images back to 24-hours ago. That might confuse the temporal convolutions in the 3D CNN. But hopefully it'll be fine, and the 3D CNN may just learn different filters for different sample periods.
- PV power from all PV systems in the region of interest (either the "pseudo-label latent proxy for irradiance" or real PV power). (Maybe use a 1D CNN to convolve over time. There is some evidence that using an RNN to pre-process the PV timeseries is helpful. Although a CNN may not handle missing PV data well. Zeroing out missing data, and then masking in the Perceiver should work. Perhaps with multiple timesteps of PV data in each query. If there is too much historical PV data to feed into the main transformer then maybe use a "hierarchical Perceiver" type approach (openclimatefix/power_perceiver#14), where a "mini Perceiver" would pre-process the PV data).
History & Future:
- NWPs of the past and future (maybe pre-processed with a separate 3D CNN -> ResNet. A ResNet CNN might be useful to pick out interesting features in the NWP, such as a weather front.). If we're only predicting for one timestep at a time, then just include future NWPs around that target timestep.
Queries:
- Query for the PV power at, say, 64 random locations in the central crop. Or maybe create "dense, gridded" PV power predictions: Predict for every satellite patch? As that might be what we need to do to forecast PV for whole regions (e.g. GSP regions)?
Encode everything with:
- Sun's azimuth & elevation
- Spatial location (just relative location? just absolute? Or both?)
- Time (just relative? just absolute? or both?)

The training target is just to predict PV power for a bunch of PV systems in the region of interest. (Well, actually, when pre-training on areas without any PV systems, the training target would be the "latent proxy for irradiance", not PV power. When fine-tuning on actual PV power, we'd bolt on the "'latent proxy for irradiance' to 'PV power'" ML model to predict PV power).

In the past, I've been sceptical that training a model just on predicting PV power would be sufficient. But, hopefully, using pseudo-labels will give us enough training examples.

The "transformer" in the "PV predictor" would probably have to be a Perceiver (with tweaks from the Flamingo paper. See openclimatefix/power_perceiver#154) in order to handle the number of input elements.

Predicting PV power for all GSPs & national

(Google Draw link)

To predict PV power for all GSP regions and national, we could query the "PV Predictor" for PV power at many locations across the country. (Maybe these locations could be sampled from the PV capacity map of the country? Although that changes over time. But that's fine: each example could have a different sample of locations. If nothing else, that would help increase diversity during training). If (as is likely) the entire country doesn't fit into the PV Predictor's central crop, then divide the country up into a set of non-overlapping rectangles, and pass through the PV Predictor once for each rectangle.

Then we could have a separate model which combines those individual predictions into a prediction for each GSP region.

And maybe the 3D CNN -> ResNet would spatially down-sample the satellite image quite a lot, so the model can see a large area in one go. Which is useful to see clouds coming a long way off. And useful for predicting the entire country's PV power output (and GSP power) in a single forward pass.

To increase the number of training examples seen by the "GSP & National Predictor", we could pretrain the "GSP & National Predictor" using the pseudo-labels. i.e. Instead of using the "PV Predictor", we'd sample locations from the pseudo-labels (off disk). Those locations could be sampled from the PV capacity map.

Note that it's likely that we don't need to convert from "latent proxy for irradiance" to actual PV before going into the "GSP & National Predictor" because the "GSP & National Predictor" actually wants un-biased estimates of the "PV irradiance" (independent of the specifics of the PV systems that happen to be in the Passiv dataset).

Advantages:

Allows us to make use of whole extent of the satellite imagery without forcing the model to do the (probably overly onerous) task of predicting future satellite images pixel-wise.
Actually a fairly simple approach (in principle!)
Pseudo-labels should be useful for all OCF's PV prediction models.
Pseudo-labels should be useful for people outside of OCF :slightly_smiling_face:
At inference time, we only need the "PV predictor" model, not the "labeller" model. So the model that's used in production will be simpler (unless we want to use the "PV inference at the same timestep" model on historical satellite data, to calibrate the predictions?

Questions:

Would we run the "infer PV power at the same timestep" model on-the-fly during training of the "PV prediction" model? Or would we pre-compute a whole bunch of "pseudo labels" and save it to disk? I guess the pseudo-labels are small, so maybe we'd save them to disk ahead-of-time? [Update: We're pretty confident we want to write the "latent proxy for irradiance" to disk]
At inference time, when we're predicting for "real" PV systems, is it useful to feed the model "pseudo-labels" of estimated PV power at regular locations in the satellite image?

Some notes from the paper

The 3D CNN is really important. They show its importance in the paper.

jacobbieker commented 2 years ago

Really interesting paper! For predicting the PV for whole region, I think the downsampling is a pretty good approach, that's how MetNet models incorporate huge context areas as well.

I do really like the idea of making pseudo labels for the entire satellite area and all the extra training data that could give us. Maybe we could also get some PV data from Germany, or Italy so have extra "real" PV in more geographic areas for the labelling model to learn better? Although I guess it should be able to learn enough from just UK systems.

Do the MetOffice NWPs extend across Europe? OR would we need to get other NWPs for those areas?

I would think directly predicting the real PV power would be more informative for the labeller than using processed data, if we want the labeller to match the real data as closely as possible.

For the Questions: I'd think we would want to save the psuedo labels to disk as they should be small like the current PV power data, and to ensure the labels are all the same for experiments. For inference, I think if the labelling model is good enough to provide realistic labels, then it makes sense that that would help at inference time as well, especially for GSPs with sparse PV data, or if PV data goes offline.

JackKelly commented 2 years ago

Awesome!

Maybe we could also get some PV data from Germany, or Italy so have extra "real" PV in more geographic areas for the labelling model to learn better?

Definitely! Which would also help us satisfy our commitments for the Google.org funding!

As we get more PV data, I guess it'll become important to automatically filter out bad PV data (#180).

A few random thoughts, in no particular order:

Publish the pseudo-labels and the labelling model

Now that I think about it, several people over the years have expressed an interest in having a gridded PV power / irradiance dataset. So, if the "labelling model" works well, then we could definitely consider packaging it up as a neat stand-alone thing, and releasing the model weights & dataset of pseudo-labels. Although I'd guess most people would want a dataset of irradiance rather than PV power. So maybe we should train the "labeller" on actual measured irradiance (e.g. from MIDAS) as well as PV power? And try to train the labeller to produce irradiance and/or PV power, depending on the query. And then we could release both a gridded irradiance dataset, and a PV power dataset. (How would we handle the fact that different PV systems behave differently? Maybe release a set of PV power datasets for the most representative "types" of PV system behaviour? Or release a simple model to fine-tune the PV power "labels" to specific PV systems? Or do both?)

Although it might be quite a lot of work gathering irradiance data from around the world. (I don't think there's a nice, neat, international database of irradiance data like there is for PV power (PVOutput.org)). So maybe we should encourage users to fine-tune the model on irradiance data for their country.

@dantravers and @JamieTaylor-TUOS, do you guys have any thoughts about whether it'd be useful to release a "gridded irradiance / solar PV power" dataset across the whole geographical and temporal extent of the EUMETSAT SEVIRI RSS data? (Basically: all of Europe and North Africa). Any requests for that dataset? :slightly_smiling_face:

Train a relatively simple model on huge amounts of data

Power Perceiver, in its current guise, is actually quite complicated (a few thousand lines of PyTorch code). That's because it has three stages, which each require the data to be in a different shape.

The Open AI minecraft paper talks about the recent success of relatively simple models (usually a transformer) trained on vast amounts of data. It's a very attractive idea. Hopefully we can do the same trick: Pre-train a relatively simple model (built using off-the-shelf components) on huge amounts of pseudo-labelled data.

jacobbieker commented 2 years ago

These people are releasing a dataset on global GHI, DNI, and something else with ground truth set around the world: https://twitter.com/IEA_SolarPACES/status/1531884649013813249 not sure the timeline for it, but seemed to be soonish.

I would think, just to start, creating a labeller for PV output might be simpler, because of the ground truth we have. Although both types would probably be quite useful! And yeah, I think having a set of datasets for different types of PV systems, so then people can see how different types of PV systems would work in a location.

akanshasingh803 commented 2 years ago

Have a look at our paper titled, "A Moment in the Sun: Solar Nowcasting from Multispectral Satellite Data using Self-Supervised Learning" where we have trained a global model using self-supervised learning to predict future satellite observations at t+1 using abundantly available unlabeled satellite data and further used them to nowcast solar 15 minutes into the future using another local solar model that takes into account historical solar generation as well as temperature values. Here are the links- https://dl.acm.org/doi/10.1145/3538637.3538854; https://arxiv.org/abs/2112.13974

JackKelly commented 2 years ago

Awesome, thank you!

JamieTaylor-TUOS commented 2 years ago

@JackKelly I would think a gridded PV yield (generation per kWp DC capacity) dataset would be very valuable, particularly if it was international, sub-hourly resolution and included a way to account for factors like orientation, tilt and "installation quality". Irradiance would be even more valuable but much harder to validate (MIDAS pyranometers are great, but not that many locations).

JackKelly commented 2 years ago

That's really interesting to know, thank you @JamieTaylor-TUOS!

JackKelly commented 2 years ago

@jacobbieker while I've been off sick, I've been thinking a bit more about this approach, and I've updated the post at the top (including a new diagram!) The approach is still the same... hopefully I've answered a problem that was bugging me (about how to handle the difference between different PV systems!)

JackKelly commented 2 years ago

Use "leave-one-out cross validation" when evaluating our pseudo-labels. See paper "Nearest neighbour distance matching Leave-One-Out Cross-Validation for map validation". See slide 12 from Hanna Meyer's presentation.

JackKelly commented 2 years ago

For the pseudo-labelling model, maybe also give it ground-truth PV data from neighbouring PV systems. But also train with lots of examples with no PV data (because, for example, there's no PV data over the ocean). So, maybe, for a given ROI, if there's more than 1 PV system then use a random proportion (but always at least 1 PV system) as the target, and the rest as inputs. But frequently drop out all the PV inputs.

jacobbieker commented 1 year ago

@simlmx here is some of the original thoughts for doing the PV labelling/labeller

openclimatefix / pseudo-labeller