Intercomparison duplicates fix

nasaharvest / crop-mask

End-to-end workflow for generating high resolution cropland maps

Apache License 2.0

94 stars 26 forks source link

Intercomparison duplicates fix #364

Closed ivanzvonkov closed 5 months ago

ivanzvonkov commented 5 months ago

The duplicates issue comes up in the following datasets: both GFSAD datasets, Digital Earth Africa, Harvest Maps, ESRI LULC. I have only investigated GFSAD but I suspect it's the same issue for all the other datasets.

During intercomparison points were sampled from each image in the relevant imagecollection within the specified boundary. However several imagecollections have overlapping images causing them to be double sampled. See for example GFSAD has predictions for the same tile within the Africa 30 m collection and the Europe, Central Asia, Russia, Middle East 30 m collection.

The current solution is to sample from a mosaic of the imagecollection which avoids the double sampling.

Additional to do: [ ] Figure out how the value at overlapping images is calculated when using the mosaic function [ ] Rerun all intercomparison notebooks

review-notebook-app[bot] commented 5 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

hannah-rae commented 5 months ago

Thanks for looking into this @ivanzvonkov !

[ ] Figure out how the value at overlapping images is calculated when using the mosaic function The mosaic() function uses "last on top" but if they are all the same time period, I'm not sure what is considered last. You could alternatively use mode() or something more transparent.

ivanzvonkov commented 5 months ago

Rerunning intercomparison:

Zambia, ensemble still does best, F1 73% -> 76%
Senegal, ensemble still does best, F1 67% -> 66%
Togo, changed to test set only to have objective measure of harvest-dev
Tigray and Rwanda unable to rerun because of access issues, will push all notebooks when access is granted

ivanzvonkov commented 5 months ago

Rwanda intercomparison update, now digital-earth-africa moves up to first:

ivanzvonkov commented 5 months ago

Tigray 2020, less points but same order for first three maps:

ivanzvonkov commented 5 months ago

Tigray 2021 same order for top 3

ivanzvonkov commented 5 months ago

The mosaic() function uses "last on top" but if they are all the same time period, I'm not sure what is considered last. You could alternatively use mode() or something more transparent.

I asked the GFSAD team about why they have duplicate tiles and they responded with:

Our global cropland extent product was mapped by seven different researchers that used slightly different methods to account for differences in agriculture and satellite imagery available. Since the products are split into 10x10 tiles, there are some tiles that contain multiple products/continents.

For the example tile given, N10E30: when you are interested in Middle East use GFSAD30EUCEARUMECE; when you are interested in Africa use GFSAD30AFCE.

So I think the right way to evaluate these would be to be deliberate about the selection of the layer. Mode is not exactly that and introduces the complexity of dealing with 0.5 values. So for now I am going to keep mosaic(). @hannah-rae

hannah-rae commented 5 months ago

@ivanzvonkov Good to know about the different layers, I think eventually we can update that to choose the layer based on an argument for each dataset (maybe we add a continent attribute to the eval datasets).