Open Stephen-Gates opened 8 years ago
perhaps a web tool where you drop your file and it tells you what is needed to comply to standards?
@RMHogervorst that is essentially what GoodTables does:
It is also available as a CLI or a python lib:
And, we are currently finishing off our Data Quality Dashboards, which could be used (they pretty much meet the challenge already :)):
Example data for quality assessment:
We are currently working on the feature/refactor
branch of all these data-quality-*
codebases, and will be happy for contributions and questions in around a week.
Oh great! That is very useful
Thanks @pwalsh, great to see you here. I'll check out the data quality dashboard. Hi @RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.
Thanks for this suggestion @Stephen-Gates. Despite work already done in this area, I think there is still some scope for R tools. Ideas that come to mind:
Speaking as someone who has both a) worked at a data portal and b) have published my own data data, so I agree with your aims to ensure quality. However I hope that this will be a conscientiously constructive and collegial process, rather than what could become quite easily (ie without meaning to) a bit embarrassing for people who are shown their publishing systems/publications are considered 'poor quality'. We don't want to provide disincentives and shame those that are essentially altruistically publishing data at a time when there is no real incentive to do so.
It will also be important to define what is considered 'good quality'.... eg some non-tidy data are well suited to their purpose, as pointed out here by Jeff Leek http://simplystatistics.org/2016/02/17/non-tidy-data/
On Mon, Mar 14, 2016 at 11:04 PM, Stephen Gates notifications@github.com wrote:
Thanks @pwalsh https://github.com/pwalsh, great to see you here. I'll check out the data quality dashboard. Hi @RMHogervorst https://github.com/RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.
— Reply to this email directly or view it on GitHub https://github.com/ropensci/auunconf/issues/9#issuecomment-196279660.
@MilesMcBain I looked at visdat and was totally exciting to see it was inspired by CSV Fingerprints. I think your suggestion would be a wicked combo.
ivanhanigan Totally agree. This is not a name and shame. I have spoken with some portal owners and data publishers and they're keen to understand how to improve and demonstrate that they are improving over time. So perhaps a tool that graphs progress of time would be useful?
Re: what is good quality data?
My simple approach is "is it published as promised". E.g.
I'm sure there are more scientific definitions of data quality... feel free to use those also.
So these are actually two different use cases, there is the checking of meta data and the data.
@RMHogervorst that's correct but the challenge is totally flexible. Focus on what helps you, the community, and data publishers - or something else entirely ;-)
@Stephen-Gates I suggest these use cases have many dimensions. I'd like to specify the aim more before exploring the possibilities. In particular I note the different 'quality benchmarks' applicable to data portals run for government depts vs portals run for scientists. The former might be replete with administrators with data curation high in their work priorities, while the latter may be cobbled together by scientists eschewing the compulsion to compete and instead opting for open science, or alternately reacting to funders/journals requirements to publish supporting information and data with papers. The expectations you might have for quality metadata/data in the former might well be a lot higher than for the latter (and this would be justifiable given the lack of resourcing funders/universities give scientists to engage in data publishing activities).
Another dimension that is not clear in this thread is the spectrum between open data and mediated data. Often mediated data is easily available with portals simply requiring user registration so they can collect download statistics and analyse usage by demographic groups, or to meet data depositors requests to be made aware of proposed re-use so that they can keep in contact and provide collegial support for downstream users of their data. These data are not technically open, but in practice they are essentially open. I suspect quality may differ between purely open and mediated-but-easy-to-get-at
data portals, and this might be worth thinking about too.
My 2cents.
@ivanhanigan Great points. I think Governments are equally resource constrained when it comes to publishing open data and the variation in quality will be equally diverse. I understand that many research data portals are not technically open and may not present an API to the catalogue. So if anyone was considering the challenge, I'd suggest using a government CKAN portal that presents an open API. You could explore data.gov.au or data.qld.gov.au (see http://docs.ckan.org/en/latest/api/index.html).
More food for thought:
To me, thinking about data science goes together with assessing quality. I think this collection of data science links and this list of public datasets are relevant to this topic.
Create a tool to assess the quality of open data in an open data portal. a challenge by ODI Queensland
Build on prior r work:
Leverage existing validation tools:
Apply standards, best practices or quality measures:
Assess an open data portal or two:
Use any or none of these suggestions to provide insights about the quality of open data and how it is published.
Help open data publishers improve so the data they publish can be used to deliver ongoing value.
Thinking about taking the challenge? Got questions? Reply below and we'll do our best to answer.