openclimatefix / predict_pv_yield

Using optical flow & machine learning to predict PV yield
MIT License
51 stars 11 forks source link

Try predicting *just* clouds (not background) by subtracting 'non-cloudy' image from actual image #17

Open JackKelly opened 3 years ago

JackKelly commented 3 years ago

Quentin Paletta et al.'s 2021 "ECLIPSE" paper show that it's a good idea to predict cloud masks, because these are concise representations of the sky (with the caveat that I'm afraid I've only skim-read the paper so far, so I may have misunderstood!).

Also, ideally, we'd probably prefer satellite image sequences of just the clouds, rather than the clouds plus the land. Sometimes optical flow messes up by moving the land around (!), so optical flow would probably perform better on 'pure cloud' images (with the land removed). And, for ML approaches for forecasting future satellite images, it's unfair to expect the ML model to reconstruct images of land as the cloud moves away from land, when that information may be entirely absent from the input image sequence (because the cloud has completely covered the land in the input image sequence).

On the other hand, I'm a little nervous about using binary cloud masks because clouds are so varied, and thin wispy clouds which might not be classified as 'cloud' by a binary mask can have a significant effect on irradiance.

Also, reliable cloud mask labels are hard to come by, I think? (Sure, there are algorithms for segmenting clouds, but none are perfect, right?)

So maybe we can automatically subtract 'cloud' from 'land' something like this: If we had a perfect 'cloud-free' satellite image for every time of day, then we could do a pixel-wise subtraction: just_cloud_image[t] = cloud_free_image[t] - actual_image[t]. Then we could use the just_cloud_image for our downstream models.

The question then becomes: How to generate the set of cloud_free_image[t] for every time of day, and time of year? Could it be as simple as taking the median pixel value, per pixel, at a given time of day, over the last month or so of imagery? e.g. to get the cloud_free_image for 12:00, look at all the images taken at 12:00 over the last month, and take the median pixel value?

Or maybe the median is the wrong statistic: Maybe instead we can assume the histogram of pixel values at a given time of day, over the last month, would have two peaks: one corresponding to 'cloud free', and the other corresponding to 'cloudy'. And we want the mode of the 'cloud free' peak, which I guess will always be the less-bright peak?

(also see the twitter discussion about this issue)

TODO:

danstowell commented 3 years ago

Using masks (whether binary- or soft-masks) has a strong recent history e.g. in audio source separation by deep learning, because they're stable targets for prediction (no need for the dataset to cover all the exact pixel values etc). I'd suggest that using cloud masks is likely to be a reliable route. If you don't have labelled data for that, you could consider synthetically generating data.

I don't have the experience to know whether the errors induced by existing cloud-segmentation algorithms are anything to worry about.

Your just_cloud_image approach is interesting, but raises lots of questions. Clouds don't really affect pixels in an additive way, more a masking way, so modelling it as a delta seems unlikely to create a stable target for inference. (parkinglot - heavycloud is very different from footballpitch - heavycloud, even if the observed heavycloud is identical in both cases!) Worth a pilot test of some sort, sure.

JackKelly commented 3 years ago

Hey @danstowell , thanks loads for the reply! That's an excellent point that footballpitch - heavycloud will give a different result to parkinglot - heavycloud; I must admit I hadn't thought about that!

Inspired by your point...

I guess the perfect 'cloud segmentation' would segment & classify cloud based on how much sunlight it lets through (its 'optical depth'). There are algorithms for this (e.g. EUMETSAT / CMSAF's cloud optical thickness product). But maybe we could improve on these using ML segmentation algorithms, and calibrate the segmentation classification at inference-time using realtime PV data from neighbouring PV systems ("we know exactly how much sunlight that cloud is letting through because we have realtime PV data from a PV system under that cloud right now!"). So we'd end up with a model that segments out the clouds in satellite data, and maybe estimates the optical depth each segment (or of each pixel?). Then, we could 'just' use optical flow to move those segments around (although optical flow doesn't understand that clouds change size over time!).... hmmm....

leonoverweel commented 3 years ago

Thanks for starting this thread @JackKelly!

On the other hand, I'm a little nervous about using binary cloud masks because clouds are so varied, and thin wispy clouds which might not be classified as 'cloud' by a binary mask can have a significant effect on irradiance.

There are algorithms for this (e.g. EUMETSAT / CMSAF's cloud optical thickness product).

Another similar one here FYI: http://climexp.knmi.nl/select.cgi?id=someone@somewhere&field=cru4_cld_10_old. We've used this before but didn't find too much signal in it over other raw data in solar forecasting.

But maybe we could improve on these using ML segmentation algorithms, and calibrate the segmentation classification at inference-time using realtime PV data from neighbouring PV systems ("we know exactly how much sunlight that cloud is letting through because we have realtime PV data from a PV system under that cloud right now!").

My intuition would be to just use this data directly in forecasts instead of using it to forecast the intermediate/latent variable of cloud coverage. "We know park A is at 50% of expected capacity and park B is at 70% [both probably because of clouds], which is a signal that we may want to decrease park C [for which we don't have real-time data and which is between A and B] to 60% as well." I could be wrong though!

JackKelly commented 3 years ago

We've used this before but didn't find too much signal in it over other raw data in solar forecasting.

That's super-interesting!

(BTW, @leonoverweel & @danstowell, I should intro you to each other, especially as you're both in the Netherlands :) )

On Dan's point:

Clouds don't really affect pixels in an additive way, more a masking way

We might be saved by the very low spatial resolution of the EUMETSAT SEVIRI data (about 5km^2 per pixel). So land does look reasonably samey.

leonoverweel commented 3 years ago

@danstowell and I have Twitter-met actually! When it's possible again though, we should organize a meetup sometime - Dexter would be happy to host.

maxaragon commented 3 years ago

Hi Jack, your idea is basically to develop a clear sky dictionary (CSD) check out this paper for an automated CSD https://www.sciencedirect.com/science/article/pii/S0038092X20311117 however this approach will turn challenging due to the diversity of land cover.

On the other hand, CNN's have demonstrated to outperform CSD for cloud segmentation, check out this paper: https://www.sciencedirect.com/science/article/pii/S0038092X2030147X

My suggestion is to create a crowdsourced dataset of cloud masks using a semi-automatic tool for image segmentation for further prediction using CNN.

Last year I saw a citizen science project for cloud masks: https://www.rmets.org/metmatters/cloudcatcher-citizen-science-project

I am happy to collaborate.

tcapelle commented 3 years ago

Man, it is time to start annotating. I am currently using a Unet to segment clouds from our sky imager and using https://github.com/Britefury/django-labeller to make the segmentation mask. With only 400 images the Unet outperforms anything else we have tried. You could use your technique and the refine the masks with the tool.

JackKelly commented 3 years ago

Oooh, this is all super-helpful!

I definitely like the idea of a citizen science project to label cloud masks; and that sounds like something Open Climate Fix could / should help with :) Let me talk to the rest of the Open Climate Fix folks about the idea....

JackKelly commented 3 years ago

@tcapelle that's really interesting to know that Unet is performing well for cloud masks. Just curious: Have you tried the Axial DeepLab approach for segmenting clouds? (I know I'm a bit of a fanboy for self-attention!)

Also, thanks loads for letting me know about django-labeller - looks extremely useful!

simonpf commented 3 years ago

If you want to cook your own satellite-based cloud detection it may be worth considering the DARDAR products which are based on CloudSat (radar) and Calipso (lidar) and probably are the best source for cloud detection (and other cloud properties) from space. These obs can be co-located with MODIS (easy because they fly in constellation) or geostationary. DARDAR then gives you a ground truth along a line in your input image which can be used to train a conv net with a masked loss. A Master's student used this to predict cloud ice content over Africa and it seems to work quite well.

There are a few caveats but if you are interested in cloud detection from satellites this should be an easy way to get a very large training data set.

quentinpaletta commented 3 years ago

@tcapelle I also had a try with Unet to segment sky images from 20 labelled images and results were quite impressive. If you know more about clouds you might be able to distinguish cloud types too.

@JackKelly Regarding predicting future satellite images and more specifically a land area covered by clouds in the past sequence, I can image that optical flow gives mixed results but a ML model would have seen the area free of clouds in quite a lot of training samples thus might be able to reconstruct it? (At least if images cover the same area)

tcapelle commented 3 years ago

@tcapelle that's really interesting to know that Unet is performing well for cloud masks. Just curious: Have you tried the Axial DeepLab approach for segmenting clouds? (I know I'm a bit of a fanboy for self-attention!)

Also, thanks loads for letting me know about django-labeller - looks extremely useful!

I am currently using the torchvision deeplabv3 model with great success.