Quality Score for each Processed Data Set

smkuhlani commented 1 month ago

Hallo Carobiners,

Maybe we have discussed about this before. What are your thoughts on having a quality score being mandatory for each data set that is processed? The score could be based on e.g. clarity and confidence in the 1-treatments, 2-management and phenology time stamps, 3-fertilizer information, 4- data itself. The score could be between 1-10, with10 being the best and 1 being the worst. This could be useful for some cases, where data is used in sensitive applications e.g. AgWise fertilizer recomendations, where once can select data with grades maybe 7-10. For non sensitive applications e.g. ML, a user can select all data. @rhijmans @egbendito @asila @cedricngakou

egbendito commented 1 month ago

This could be a good idea. For example, I would place higher trust in datasets with higher variability in the dates (planting, harvesting, etc) than those where all planting_date happended on the same date in multiple locations...

For certain variables the ranges are there, but those are more for extremes. Perhaps calculating the distance from observations already existing database? Although I am not sure if that's a very good "inward-looking" assessment?

For the treatments, this could be a subjective assessment during the review process (maybe)?

I guess this quality score would be part of the metadata...

rhijmans commented 1 month ago

One of the things on the to-do list I discussed with Eduardo is to generate an automatic score. That covers some of what you mention, such as the absence of required variables, and the quality thereof (e.g. is planting_date an actual date, or only a year).

I do not think we can ask script authors to score between 1 and 10. First, it would not be easy to get a consistent score across contributors. Second, the quality of a dataset can depend on its use.

It is easy to state that a variable is missing. But what if the quality is bad in the sense of the reported values. You should generally not use bad data for ML or anything else. If we have data where the values are unreliable (and you might give a score < 3) we should probably flag these, and perhaps not even include them in the aggregate datasets. I can imagine using three categories: "reliable", "not sure", "unreliable". But we would need some concreate examples to discuss further. These are hard calls to make. Instead of worrying about all datasets we could deal with it when we come across datasets that are clearly unreliable.

I have added a "notes" variable to the metadata. Therein, you can add any concern about the data. This replaces the "NOTES:" in the files which are otherwise not easily accessible. A Carob data user could review the notes of the datasets they may want to use, in addition to the automatic quality indicators (there could be a single average score, and also scores for the components that are evaluated).

asila commented 1 month ago

We have datasets where location coordinates are geocoded. While it is a good idea to enforce data completeness it may also not be giving exact location where the data was collected. Can this be example where we can score as "not sure" or "unreliable"?

rhijmans commented 1 month ago

We should do better with documenting the uncertainty and the source of the georeferences. The uncertainty can be stored in the "geo_uncertainty" variable. There are established methods to compute the uncertainty; but we have not asked anyone to use these. The uncertainty varies by record. In some cases I can assign with high precision (e.g. a research station). I would also note that the georeferences that come with a data set are not necessarily accurate (sometimes on purpose), and it would be useful to provide an accuracy estimate where possible.

I have now also added a logical field "geo_from_source" which indicates if the georeference is from the data source (TRUE) or added by the Carob script author (FALSE). That should become a required variable.

Thank you for bringing up this variable. It is the only one where we invest a lot in enhancing the datasets --- because it is fundamental to our interpretation of agricultural data but often not included. Our enhancements should be tracked better. It also illustrates why it may not be easy to assign a score to a dataset.

rhijmans commented 1 month ago

I just came across a case where the paper states that the research location location was

Borlaug Institute for South Asia (BISA)-CIMMYT, Ladhowal (30.99 °N latitude, 75.44 °E longitude, 229 m amsl), Punjab, India.

But the BISA is at 30.99, 75.735 (and that is near Ladhowal), so I did

d$country  <-  "India"
d$adm1 <- "Punjab"
d$location <- "Ladhowal"
d$site <- "Borlaug Institute for South Asia"
d$latitude <- 30.992
d$longitude <- 75.735
d$geo_uncertainty <- 1000
d$elevation <- 229
d$geo_from_source <- FALSE

With geo_from_source as FALSE because the coordinates are not what the source provided (they are better). I set the uncertainty to 1000 m. I do not know where exactly at BISA, but it is not likely more than 1 km away from the coordinates.

rhijmans commented 1 month ago

In summary:

caveat emptor. We make the data more easily available. Users need to do the work to figure out what data are fit for their purpose.
consistent grading seems hardly possible. We should focus on objective quantitative measures that can be used to guide users.
in addition, the new "notes" metadata variable to inform (warn) users at the dataset level. It is also useful for editors to check for remaining issues in accepted scripts
there are additional quality related variables for the coordinates. We need to include more complete georeferencing guidelines.
further quality control checks are (always) under development. In the future, we will also do additional checks on the aggregated data for consistency and outliers of records, as well as datasets.
over time, these things will be refined further. It is probably best to do this with concrete examples.

reagro / carob

Quality Score for each Processed Data Set #472