openclimatefix / pseudo-labeller

Pseudo Labeller for generating training labels for other PV generation forecasting
MIT License
0 stars 1 forks source link

Pseudo-label satellite imagery using an ML model. Then predict future PV using those labels #1

Open JackKelly opened 2 years ago

JackKelly commented 2 years ago

We've known for a little while that an ML model can do surprisingly well at inferring PV power at time t from satellite imagery at time t (and even better from a handful of satellite images, centred on time t.)

OpenAI's new paper

In OpenAI's new paper on "learning to play minecraft with video pre-training", the authors train a model to infer actions from minecraft videos. (Yannic Kilcher has a great video on the paper.) This model is a 3D CNN -> ResNet -> transformer. Crucially, this model sees the future. Which makes it much more reliable at inferring actions. The authors use this model to pseudo-label tens of thousands of hours of minecraft videos from YouTube. The authors use these pseudo-labels to pre-train another model to actually play minecraft, and finally this model is fine-tuned on human-labelled data.

Pseudo-labelling satellite images with a "latent embedding of irradiance".

Maybe we could do something similar. We train a model (perhaps a 3D CNN -> ResNet -> transformer) to infer PV power at time t from satellite images (multiple channels) centred on time t (and NWPs?). We could use this model to pseudo-label satellite imagery for which we have no PV power data. The end result could be a gridded dataset of PV power, across the entire temporal and spatial extent of the satellite imagery. This would give us many orders of magnitude more data to train a "PV predictor" on!

image

(Google Draw link to this diagram)

One detail not shown above is that perhaps we could include HRV in the same CNN stack as the other satellite channels by first downsampling HRV with a 2D CNN?

Another thought not shown in the diagram is that perhaps we should provide the model with PV data from neighbouring PV systems (see the comment below).

A proposal for how to handle the extreme variability between different PV systems

The power output can vary dramatically between different, geographically close PV systems. PV systems are very heterogeneous. Differences include tilt, angle, age, shading, efficiency, soiling, whether the PV panels are oversized compared to the inverter, whether the "PV system" is made up of PV panels in more than one plane, etc.

This makes life a little tricky for the labeller. How do we create a set of labels that are representative of all PV systems?

My original idea was to cluster PV systems into, say, 16 different PV "types". And then create 16 different gridded pseudo-labelled datasets, one for each PV "type".

But I now think a much more elegant solution is to build a model architecture that reflects what happens in the physical reality: The atmosphere affects the amount of irradiance that arrives at the PV system; and the PV system transforms that irradiance into electrical power. It's a two-step, strictly serial process. We could ask our labeller to predict the irradiance just above the PV system (independent of the specifics of each PV system). And then we could have a separate little ML module which maps from irradiance to PV power, given each PV system's properties.

The problem, of course, is we don't have measurements of the irradiance just about each PV system! Irradiance datasets do exist, but they have several orders of magnitude less sensors than there are PV systems, so we're ignoring irradiance datasets for now.

Instead, we can try to infer a "latent proxy for irradiance" by deliberately not telling the majority of the model anything about the specifics of each PV system (other than the PV location). Hopefully this will force the model to learn to infer something like the irradiance. It won't actually be the irradiance. But it should be a concept that is vaguely similar :slightly_smiling_face:.

During training and validation, we'll use the little "'latent proxy for irradiance' -> 'PV power'" ML module to predict PV power. But, when we come to create our gridded "psuedo-labelled" dataset, we'd skip that step and save the "latent proxy for irradiance" to disk. And our "PV predictor" would directly predict this "latent proxy for irradiance", and use the pre-trained "'latent proxy for irradiance' to 'PV power'" module to get PV predictions.

Another advantage of this approach is that, in production, when new users turn up with data for a new PV system, we can start creating predictions immediately (by feeding their PV system's azimuth & tilt into the "'latent proxy for irradiance' to 'PV power'" module). And then, perhaps just an hour or so later, we can give them even better predictions just by re-training the "'latent proxy for irradiance' to 'PV power'" module using their data. We wouldn't have to retrain the entire system (although we'd probably want to retrain the entire system every week, say).

Questions / experiments for the "labeller".

Predicting PV for individual sites (pre-training on the psuedo-labelled data)

image

(Google Draw link)

We then train a similar-architecture model to predict future PV (not future imagery) using these pseudo-labels, in a supervised fashion.

Specifically, the solar PV forecasting model would, at a minimum, receive as inputs:

The training target is just to predict PV power for a bunch of PV systems in the region of interest. (Well, actually, when pre-training on areas without any PV systems, the training target would be the "latent proxy for irradiance", not PV power. When fine-tuning on actual PV power, we'd bolt on the "'latent proxy for irradiance' to 'PV power'" ML model to predict PV power).

In the past, I've been sceptical that training a model just on predicting PV power would be sufficient. But, hopefully, using pseudo-labels will give us enough training examples.

The "transformer" in the "PV predictor" would probably have to be a Perceiver (with tweaks from the Flamingo paper. See openclimatefix/power_perceiver#154) in order to handle the number of input elements.

Predicting PV power for all GSPs & national

image

(Google Draw link)

To predict PV power for all GSP regions and national, we could query the "PV Predictor" for PV power at many locations across the country. (Maybe these locations could be sampled from the PV capacity map of the country? Although that changes over time. But that's fine: each example could have a different sample of locations. If nothing else, that would help increase diversity during training). If (as is likely) the entire country doesn't fit into the PV Predictor's central crop, then divide the country up into a set of non-overlapping rectangles, and pass through the PV Predictor once for each rectangle.

Then we could have a separate model which combines those individual predictions into a prediction for each GSP region.

And maybe the 3D CNN -> ResNet would spatially down-sample the satellite image quite a lot, so the model can see a large area in one go. Which is useful to see clouds coming a long way off. And useful for predicting the entire country's PV power output (and GSP power) in a single forward pass.

To increase the number of training examples seen by the "GSP & National Predictor", we could pretrain the "GSP & National Predictor" using the pseudo-labels. i.e. Instead of using the "PV Predictor", we'd sample locations from the pseudo-labels (off disk). Those locations could be sampled from the PV capacity map.

Note that it's likely that we don't need to convert from "latent proxy for irradiance" to actual PV before going into the "GSP & National Predictor" because the "GSP & National Predictor" actually wants un-biased estimates of the "PV irradiance" (independent of the specifics of the PV systems that happen to be in the Passiv dataset).

Advantages:

Questions:

Some notes from the paper

jacobbieker commented 2 years ago

Really interesting paper! For predicting the PV for whole region, I think the downsampling is a pretty good approach, that's how MetNet models incorporate huge context areas as well.

I do really like the idea of making pseudo labels for the entire satellite area and all the extra training data that could give us. Maybe we could also get some PV data from Germany, or Italy so have extra "real" PV in more geographic areas for the labelling model to learn better? Although I guess it should be able to learn enough from just UK systems.

Do the MetOffice NWPs extend across Europe? OR would we need to get other NWPs for those areas?

I would think directly predicting the real PV power would be more informative for the labeller than using processed data, if we want the labeller to match the real data as closely as possible.

For the Questions: I'd think we would want to save the psuedo labels to disk as they should be small like the current PV power data, and to ensure the labels are all the same for experiments. For inference, I think if the labelling model is good enough to provide realistic labels, then it makes sense that that would help at inference time as well, especially for GSPs with sparse PV data, or if PV data goes offline.

JackKelly commented 2 years ago

Awesome!

Maybe we could also get some PV data from Germany, or Italy so have extra "real" PV in more geographic areas for the labelling model to learn better?

Definitely! Which would also help us satisfy our commitments for the Google.org funding!

As we get more PV data, I guess it'll become important to automatically filter out bad PV data (#180).

A few random thoughts, in no particular order:

Publish the pseudo-labels and the labelling model

Now that I think about it, several people over the years have expressed an interest in having a gridded PV power / irradiance dataset. So, if the "labelling model" works well, then we could definitely consider packaging it up as a neat stand-alone thing, and releasing the model weights & dataset of pseudo-labels. Although I'd guess most people would want a dataset of irradiance rather than PV power. So maybe we should train the "labeller" on actual measured irradiance (e.g. from MIDAS) as well as PV power? And try to train the labeller to produce irradiance and/or PV power, depending on the query. And then we could release both a gridded irradiance dataset, and a PV power dataset. (How would we handle the fact that different PV systems behave differently? Maybe release a set of PV power datasets for the most representative "types" of PV system behaviour? Or release a simple model to fine-tune the PV power "labels" to specific PV systems? Or do both?)

Although it might be quite a lot of work gathering irradiance data from around the world. (I don't think there's a nice, neat, international database of irradiance data like there is for PV power (PVOutput.org)). So maybe we should encourage users to fine-tune the model on irradiance data for their country.

@dantravers and @JamieTaylor-TUOS, do you guys have any thoughts about whether it'd be useful to release a "gridded irradiance / solar PV power" dataset across the whole geographical and temporal extent of the EUMETSAT SEVIRI RSS data? (Basically: all of Europe and North Africa). Any requests for that dataset? :slightly_smiling_face:

Train a relatively simple model on huge amounts of data

Power Perceiver, in its current guise, is actually quite complicated (a few thousand lines of PyTorch code). That's because it has three stages, which each require the data to be in a different shape.

The Open AI minecraft paper talks about the recent success of relatively simple models (usually a transformer) trained on vast amounts of data. It's a very attractive idea. Hopefully we can do the same trick: Pre-train a relatively simple model (built using off-the-shelf components) on huge amounts of pseudo-labelled data.

jacobbieker commented 2 years ago

These people are releasing a dataset on global GHI, DNI, and something else with ground truth set around the world: https://twitter.com/IEA_SolarPACES/status/1531884649013813249 not sure the timeline for it, but seemed to be soonish.

I would think, just to start, creating a labeller for PV output might be simpler, because of the ground truth we have. Although both types would probably be quite useful! And yeah, I think having a set of datasets for different types of PV systems, so then people can see how different types of PV systems would work in a location.

akanshasingh803 commented 2 years ago

Have a look at our paper titled, "A Moment in the Sun: Solar Nowcasting from Multispectral Satellite Data using Self-Supervised Learning" where we have trained a global model using self-supervised learning to predict future satellite observations at t+1 using abundantly available unlabeled satellite data and further used them to nowcast solar 15 minutes into the future using another local solar model that takes into account historical solar generation as well as temperature values. Here are the links- https://dl.acm.org/doi/10.1145/3538637.3538854; https://arxiv.org/abs/2112.13974

JackKelly commented 2 years ago

Awesome, thank you!

JamieTaylor-TUOS commented 1 year ago

@JackKelly I would think a gridded PV yield (generation per kWp DC capacity) dataset would be very valuable, particularly if it was international, sub-hourly resolution and included a way to account for factors like orientation, tilt and "installation quality". Irradiance would be even more valuable but much harder to validate (MIDAS pyranometers are great, but not that many locations).

JackKelly commented 1 year ago

That's really interesting to know, thank you @JamieTaylor-TUOS!

JackKelly commented 1 year ago

@jacobbieker while I've been off sick, I've been thinking a bit more about this approach, and I've updated the post at the top (including a new diagram!) The approach is still the same... hopefully I've answered a problem that was bugging me (about how to handle the difference between different PV systems!)

JackKelly commented 1 year ago

Use "leave-one-out cross validation" when evaluating our pseudo-labels. See paper "Nearest neighbour distance matching Leave-One-Out Cross-Validation for map validation". See slide 12 from Hanna Meyer's presentation.

JackKelly commented 1 year ago

For the pseudo-labelling model, maybe also give it ground-truth PV data from neighbouring PV systems. But also train with lots of examples with no PV data (because, for example, there's no PV data over the ocean). So, maybe, for a given ROI, if there's more than 1 PV system then use a random proportion (but always at least 1 PV system) as the target, and the rest as inputs. But frequently drop out all the PV inputs.

jacobbieker commented 1 year ago

@simlmx here is some of the original thoughts for doing the PV labelling/labeller