magda-io / magda

A federated, open-source data catalog for all your big data and small data
https://magda.io
Apache License 2.0
507 stars 92 forks source link

Future Data Quality Metric #559

Open AlexGilleran opened 6 years ago

AlexGilleran commented 6 years ago

Split out from #518.

Right now we're using the Tim Berners Lee 5 star data rating, which is very simple but also not super-comprehensive. We could do something much more but what? This issue is for tracking the thinking around that.

Previous discussion:

Stephen-Gates commented 9 days ago I wonder if the term "data quality" is misleading

Your CSV can be a mess and you still get 3 stars http://5stardata.info/en/

Perhaps some level of validation should also contribute to the star rating https://goodtables.io

and then there is https://certificates.theodi.org/en/

@tkeuneman Member tkeuneman commented a day ago @Stephen-Gates I struggled with this wording as well. Good tables looks like it would be great to roll into the admin/maintenance/upload processes. Will look into the certificates from ODI in more detail.

This 5 star data is a good half step though. Would you have a recommendation for a vague word?

@Stephen-Gates

Stephen-Gates commented 18 hours ago @tkeuneman on data.gov.uk they call it "openness rating" e.g. https://data.gov.uk/dataset/land-registry-monthly-price-paid-data

@tobybellwood Contributor tobybellwood commented 18 hours ago One of the original concepts for this was to create a real-world usability rating. In theory, MAGDA can tell how good a dataset is from the level of processing or visualisation it can perform by default. It would acknowledge that not all CSV files are usable etc.

If a chart can be drawn, a map can be made, metadata attached (or inferred...), If temporal or spatial extents can be derived, if good (or bad) values can be detected in column ranges etc - these are all defacto indicators of quality. You just need to work out how to collect these in a usable manner in the registry.

...my 2c

@Stephen-Gates

Stephen-Gates commented 18 hours ago • Hi @tobybellwood what if a schema was inferred and then the data validated against that? Some stars for schema, more stars for valid data, even more stars if schema is contributed and data is valid.

Not a fan of inferring a spatial or temporal extent. Happy to say there's min/max values but they may not represent the true extent of the data.

You could profile the data and report descriptive statistics - I played with this idea at https://github.com/Stephen-Gates/data-quality-pattern/blob/master/data-quality-pattern.md

jyucsiro commented 6 years ago

Hi, data quality ratings are quite a challenge as there are some generic criteria and then some domain specific criteria. The other challenge is that you want a measure/rating that is easy for people to understand (e.g. 1 star or 5 stars), but inevitably there are several concerns/criteria to consider for it to be meaningful for data providers/users.

Our team have tried to capture the a range of criteria on this wiki page based on FAIR, and TBL 5-star open data with an accompanying implementation self-assessment tool, which probably could be semi- or fully-automated with a bit of work.

dr-shorthair commented 6 years ago

FAIR mixes content and access concerns. FAIR is definitely not the answer to everything - the headline criteria were partly selected in order to make a cute acronym. And FAIR is primarily aimed at research data - here I summarize an initiative to make research data routinely FAIR. But most of the FAIR principles are worth considering for other applications, including government data. Note that the FAIR criteria can be applied to either or (preferably) both metadata and data.

"quality" has many facets. A useful taxonomy is found in ISO 19157:2013 Geographic Information -- Data quality. An overview is pasted below. Not that this is mostly about content, rather than access concerns. TBL 5-star barely touches any of these.

image