sentinel-hub / sentinel2-cloud-detector

Sentinel Hub Cloud Detector for Sentinel-2 images in Python
Creative Commons Attribution Share Alike 4.0 International
428 stars 93 forks source link

Replicate Cloud Probability in Python #44

Closed mingardiluca closed 1 year ago

mingardiluca commented 1 year ago

I am trying to compute the cloud probability of a territory in Ivory Coast in python using s2cloudless, and the result that I get is different than that using the snippet of code found here. The only modifications that I made to the above script are: var START_DATE = ee.Date('2022-09-21'); var END_DATE = ee.Date('2022-09-22'); # The date of interest is September 21st 2022 var MAX_CLOUD_PROBABILITY = 101; # this is to show the entirity of the original image (clouds included) var region = ee.Geometry.Rectangle({coords: [-4.310415, 6.037586, -4.211883, 6.136218], geodesic: false}); # Area of interest Map.setCenter(-4.25, 6.1, 12);

In python, basically the entire image is predicted to have a probability of clouds of about 100%; in GEE, such probability is much lower for some areas of the image. My understanding is that GEE uses the s2cloudless algorithm for the computed probabilities in 'COPERNICUS/S2_CLOUD_PROBABILITY', so i don't understand why the result would be different when computed in python.

Thank you

Files in the zip file:

experiment.zip

batic commented 1 year ago

Hi @mingardiluca

Your question would probably best be asked at the GEE forums, but I'll have a look and try to answer in the next few days.

batic commented 1 year ago

I've tried rerunning this using sh-py (using bbox from your tiffs, and requesting dates 2022-09-21..2022-09-22, and I get this:

image

Comparing your tiff data with data downloaded from SH, I see your data is skewed (see below). Did you perhaps download data using some scaling factor?

image

I'm attaching a notebook I've used; hope it helps. s2cloudless_gee.ipynb.zip

batic commented 1 year ago

Just to add: if the bands that are input to s2cloudless are scaled, then your results are expected. The input to s2cloudless should be raw data from Sentinel-2 L1C.

mingardiluca commented 1 year ago

Hi @batic! First of all, thank you very much for your quick answer.

I've been investigating what you highlighted, by inspecting how I get the data from the beginning of my pipeline. I don't get the data from SentinelHubInputTask, I retrieve the data from Google Cloud, in this case here&prefix=&forceOnObjectsSortingFiltering=true). I found online that after January 25th 2022, the bands have been shifted by 1000 (hence your plot of distributions), according to this and this: "After 2022-01-25, Sentinel-2 scenes with PROCESSING_BASELINE '04.00' or above have their DN (value) range shifted by 1000. The HARMONIZED collection shifts data in newer scenes to be in the same range as in older scenes.". Apparently this shift is taken care of directly by SentinelHub and GEE, but I have to modify it manually in order to have the same results as you (image below). This is my output when removing 1000 from each band

mine

This is a screenshot from the notebook you sent over

yours

The results are very similar, and in line with what I saw in GEE. Does it makes sense to proceed as I explained (for data retrieved after 25-01-2022, remove 1000 for each band), or something additional is needed?

Thank you very much

batic commented 1 year ago

s2cloudless model has been trained on data prior to the changes in processing baseline from ESA, so the correct way is to make the input to the model exactly like before. The Sentinel-Hub service does that for you, see also this post, I am not sure how you need to apply this in GEE, but your thinking is correct.

In addition (again, see the post above), you should clamp the negative values to 0, as the model never saw negative values as input.

If that answers your questions, please close the issue.