theodi / open-data-certificate

The mark of quality and trust for open data
https://certificates.theodi.org/
MIT License
46 stars 39 forks source link

Frequency: add lag and volatility of thing measured? #12

Closed der closed 10 years ago

der commented 11 years ago

Some suggestions for the frequency question ...

The question talks about how frequently the data captured changes. Whereas what we need to know how that relates the underlying phenomenon. If you are publishing share price data but you only measure the share price once a week then even if you publish all those measurements instantly it's not of much value :)

Secondly, in some cases the lag is as important as the frequency. An annual budget need only be published annually but if there is a lag of 11 months between the budget being set and its publication then it's not much use. Share price is another obvious area where value depends strongly on lag.

JeniT commented 11 years ago

@johnlsheridan and I had a bit of a back-and-forth about whether to capture frequency of the real-world changes or frequency of the changes in the captured data should be the thing on which the quality of the open data provision was judged (which is part of the point of the certificate).

As I recall, the argument was that most real things in the world change constantly, but we, by necessity, measure them at intervals. For example, we can't measure water quality in a particular location more frequently than a couple of times a week, or the shifting demographics of our nation more than once every ten years. Even if the underlying phenomenon doesn't change constantly, a reuser is likely to already understand how frequently it changes. So you don't learn anything new by asking the publisher to provide that information. And it doesn't feel right to penalise publishers who don't capture hard-to-measure phenomena immediately when what we want to reward them for is making the data open in a really good way.

On your second point, agree lag is important. What question(s) would you suggest to measure it, and what requirements should there be for different levels of the certificate?

der commented 11 years ago

I take your point that if it can't reasonably be measured faster you shouldn't get penalized for it, wasn't proposing that. But in order to judge whether the frequency of collection and of publication is appropriate in order to award your stars you'll need to know about the domain and will need to gather that info somehow, how will you do that other than asking about it? Guess I don't understand how the questionnaire will be processed, will you contact a set of experts in each domain to assess the data and questionnaire answers? How will you know that weekly measurements are good for bathing water? Or that hourly updates on air pollution are a bit low (Bristol does 15min intervals)?

For lag I would simply ask "What is the expected delay between data being collected and it being published?" Of course that skates over whether that's max, median, mean. And over how to treat the delays due to validation and necessary processing compared to the publication process. But people will just have to use judgement.

In terms of levels you can't mechanically judge that, it depends so much on the data and needs judgement and an understanding of the domain. Which is also true of frequency. I don't think you should score above bronze unless you have "appropriately" low lag whatever that means in your domain.

JeniT commented 11 years ago

The basic/bronze/silver/gold levels of the certificate will be self-certified and automatic. Eventually (when there's a back end), the certificates will be made available for the community to comment on them and flag any issues they discover with the answers. If there's demand, we might at some point offer a service that provides an extra level of human auditing on a paid basis, but for now assume that nothing is routinely checked by hand.

That all limits what we can ask for and how we judge responses to questions, particularly when the issues are subtle such as what an appropriate frequency of measurement is for a particular dataset. But the certificate is a crude measure, hence a medal table rather than a score out of 100.

So this is the challenge: what question(s) can we ask about frequency and lag that would both reward people who have put effort into opening up the data that they have (whatever its quality and utility) in a good, reusable way, and provide developers with the information that they need in order to judge the quality and utility of the data for their purposes?

der commented 11 years ago

Very difficult. That probably invalidates several of my issues, had assumed that some judgement would be involved. Will think about it when I have a chance.

JeniT commented 11 years ago

Please take a look at the current version (v0.3). Under the 'Practical' tab you're asked about the type of the data release, and if you say that it's a series of releases then you're asked both how frequently the data changes and what the lag is between a dataset being created and it being published. This is judged in comparison to the gap between releases.