Notebook to format Senegal and Tigray sets for dataperf

nasaharvest / crop-mask

End-to-end workflow for generating high resolution cropland maps

Apache License 2.0

94 stars 26 forks source link

Notebook to format Senegal and Tigray sets for dataperf #365

Closed hannah-rae closed 5 months ago

hannah-rae commented 5 months ago

This notebook prepares the Senegal and Tigray datasets for the DataPerf challenge. Namely we:

want them to be 12 months time series since that's what we have in CropHarvest, and
have unanimous agreement between labelers for high confidence

Questions to discuss @gabrieltseng:

Currently I am planning to write the resulting dataframe to a csv file for each dataset with columns id, label, and eo_data (numpy array as string). Do you think there is any other info we need to include?
The Senegal set is extremely imbalanced (1235 noncrop, 105 crop). It might make sense to balance this a bit more... thoughts?
Same question for Tigray 2020 but not quite as bad (736 noncrop, 291 crop)
@ivanzvonkov I noticed a lot of points with nan data instead of the eo_data array (158 for Senegal, 168 for Tigray 2021, 173 for Tigray 2020). Is this typical?

review-notebook-app[bot] commented 5 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

ivanzvonkov commented 5 months ago

I noticed a lot of points with nan data instead of the eo_data array (158 for Senegal, 168 for Tigray 2021, 173 for Tigray 2020). Is this typical?

When there is disagreement between the two CEO sets the eo_data is never fetched. I checked Senegal and Tigray 2020 and it's all just those disagreement points:

gabrieltseng commented 5 months ago

CropHarvest subsets from February to February (specifically from 6 February to 1 February).

I think it makes sense to be consistent with CropHarvest in the subsetting here?

Otherwise looks good to me !

In terms of the specific questions:

We probably want location data too? Since this might affect how we define the task bounding box.
Agreed about the balancing - although with sufficient positives this might be less of an issue? What has your experience been @ivanzvonkov ?

ivanzvonkov commented 5 months ago

Agreed about the balancing - although with sufficient positives this might be less of an issue? What has your experience been @ivanzvonkov ?

I think balancing makes sense. Depends how the set is intended to be used. Adding crop points by sampling from existing maps could also be an option.

hannah-rae commented 5 months ago

Thanks for your feedback @gabrieltseng and @ivanzvonkov. In the new commit:

Fixed the dates to be Feb-Feb to match CropHarvest
Added lat/lon columns to the output dataframe
Decided not to balance because there are still a lot of crop points and we can use balanced metrics in our metrics

ivanzvonkov commented 5 months ago

FYI Senegal has now been updated with points from CSE which eliminates disagreement: https://github.com/nasaharvest/crop-mask/pull/369/files

Also here's a faster way to format the datasets for future use:

Otherwise looks good