Open ivanzvonkov opened 1 year ago
@cnakalembe based on our discussion, let me know if you have any suggestions / modifications.
Sounds like a good study! Not sure if you already discussed who would work on this, but it could be a good-first-issue.
I think it would be good to circle back on the work that has been done on generating dataset reports and how we can make these easily accessible/usable (including the intercomparison reports).
We have not discussed who would work on this yet.
I think it would be good to circle back on the work that has been done on generating dataset reports and how we can make these easily accessible/usable (including the intercomparison reports).
Agreed, I think it would be helpful to narrow down the target audience for this. In my view, the purpose of this issue is to help us (cropland producers) make better decisions around gathering future evaluation data, corrective labeling, and disclaimers that we can add to published cropland maps. Thereby the final deliverable of the potential solution is a wandb metric that'll be associated with a model and accessible for us.
Who would you say is the target audience for dataset reports? @hannah-rae
Context
To create a cropland map: 1) A model is trained on a labeled training dataset and evaluated on a labeled evaluation set. 2) Then the trained model makes predictions across all the data for an area of interest (AOI).
Is the cropland map good?
Issue 1: Understanding performance on evaluation dataset:
Currently we evaluate a trained model by measuring the f1-score over an evaluation dataset B. The metric helps us understand how well the model predicts crops, however it does not tell us a lot about what sort of errors the model may be making.
Issue 2: Performance of evaluation dataset translating to map quality
The score on the evaluation dataset B only matters if the distribution of B is similar to the area of interest C. Currently we:
These points help shed some light onto the similarity between B and C and thereby translation of metric to map quality. However is it possible to have more confidence about a good metric translating to high map quality?
Potential Solution:
We can use agro-ecological zones to 1) better understand performance on the evaluation dataset, and 2) better understand performance translating to map quality by measuring model performance on each agro-ecological zone represented inside the evaluation dataset.
From FAO:
Understanding performance on each zone will be especially relevant for areas of interest with many agro-ecological zones such as Uganda (#254)
This additional understanding will help inform how we gather future evaluation data, corrective labeling, and disclaimers that we can add to published cropland maps.
Potential implementation
1. Record the agro-ecological zone of each evaluation point when available. This can be implemented by adding a dataset of agro-ecological zones for a particular region and using that dataset to determine the acroecological zone for every coordinate in a
LabeledDataset
and generate an additionalagro-ecological
column: https://github.com/nasaharvest/crop-mask/blob/0cf29ff00eeecfa3385eab826fb9d2ca7654c822/datasets.py#L65 2. Using the newly generatedagro-ecological
column to record the agro-ecological distribution for each dataset indata/reports.txt
This can be implemented by adding an additional line to compute thevalue_counts()
for the agro-ecological column here: https://github.com/nasaharvest/crop-mask/blob/0cf29ff00eeecfa3385eab826fb9d2ca7654c822/src/labeled_dataset_custom.py#L105 3. Log a new per class agro-ecological accuracy to wandb in a confusion matrix to better understand how well each model is doing in each zone This requires a little more nuance becausepytorch-lightning
takes responsibility for some of the metric recording, however the relevant lines of code will be here: https://github.com/nasaharvest/crop-mask/blob/0cf29ff00eeecfa3385eab826fb9d2ca7654c822/src/pipeline_funcs.py#L100