Replicate Cloud Probability in Python

mingardiluca commented 1 year ago

I am trying to compute the cloud probability of a territory in Ivory Coast in python using s2cloudless, and the result that I get is different than that using the snippet of code found here. The only modifications that I made to the above script are: var START_DATE = ee.Date('2022-09-21'); var END_DATE = ee.Date('2022-09-22'); # The date of interest is September 21st 2022 var MAX_CLOUD_PROBABILITY = 101; # this is to show the entirity of the original image (clouds included) var region = ee.Geometry.Rectangle({coords: [-4.310415, 6.037586, -4.211883, 6.136218], geodesic: false}); # Area of interest Map.setCenter(-4.25, 6.1, 12);

In python, basically the entire image is predicted to have a probability of clouds of about 100%; in GEE, such probability is much lower for some areas of the image. My understanding is that GEE uses the s2cloudless algorithm for the computed probabilities in 'COPERNICUS/S2_CLOUD_PROBABILITY', so i don't understand why the result would be different when computed in python.

Thank you

Files in the zip file:

tiffs is the folder containing the tiffs that I dowloaded in python from the Sentinel API
code.py is a snipped of code to replicate my results
python.png is the image resulting from the tiffs, collected in python
GEE_thresh_101.png is a screenshot of GEE when MAX_CLOUD_PROBABILITY is set to 101
GEE_thresh_30.png is a screenshot of GEE when MAX_CLOUD_PROBABILITY is set to 30 (you can see that there are some parts of the image that haven't been removed, meaning that the cloud probability is less than 30%)
image.pdf has on the left the python image and on the right the cloud probability, computed in python

experiment.zip

batic commented 1 year ago

Hi @mingardiluca

Your question would probably best be asked at the GEE forums, but I'll have a look and try to answer in the next few days.

batic commented 1 year ago

I've tried rerunning this using sh-py (using bbox from your tiffs, and requesting dates 2022-09-21..2022-09-22, and I get this:

Comparing your tiff data with data downloaded from SH, I see your data is skewed (see below). Did you perhaps download data using some scaling factor?

I'm attaching a notebook I've used; hope it helps. s2cloudless_gee.ipynb.zip

batic commented 1 year ago

Just to add: if the bands that are input to s2cloudless are scaled, then your results are expected. The input to s2cloudless should be raw data from Sentinel-2 L1C.

mingardiluca commented 1 year ago

Hi @batic! First of all, thank you very much for your quick answer.

I've been investigating what you highlighted, by inspecting how I get the data from the beginning of my pipeline. I don't get the data from SentinelHubInputTask, I retrieve the data from Google Cloud, in this case here&prefix=&forceOnObjectsSortingFiltering=true). I found online that after January 25th 2022, the bands have been shifted by 1000 (hence your plot of distributions), according to this and this: "After 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes.". Apparently this shift is taken care of directly by SentinelHub and GEE, but I have to modify it manually in order to have the same results as you (image below). This is my output when removing 1000 from each band

This is a screenshot from the notebook you sent over

The results are very similar, and in line with what I saw in GEE. Does it makes sense to proceed as I explained (for data retrieved after 25-01-2022, remove 1000 for each band), or something additional is needed?

Thank you very much

batic commented 1 year ago

s2cloudless model has been trained on data prior to the changes in processing baseline from ESA, so the correct way is to make the input to the model exactly like before. The Sentinel-Hub service does that for you, see also this post, I am not sure how you need to apply this in GEE, but your thinking is correct.

In addition (again, see the post above), you should clamp the negative values to 0, as the model never saw negative values as input.

If that answers your questions, please close the issue.

sentinel-hub / sentinel2-cloud-detector

Replicate Cloud Probability in Python #44