openaq / openaq-data-format

A description of the data format provided by the OpenAQ platform.
MIT License
29 stars 4 forks source link

Data format to differentiate real-time captured-data and QA/QC'ed. #16

Open RocketD0g opened 7 years ago

RocketD0g commented 7 years ago

Below, @masalmon mentioned differences she noticed between real-time captured US Emb data from Delhi and data reported on the Emb site later. OpenAQ currently only captures real-time data for government sources, and does not allow the insertion of a QA/QC layer.

From @masalmon:

Sarath asked me to do a graph comparing Beijing and Delhi and for that I decided to use data from the embassy that is not in OpenAQ, for having a whole year for Delhi. I use my usaqmindia Github repo for that. It was a great occasion to compare both datasets for the common times: OpenAQ vs embassy data. So it's interesting to notice that there are differences. They seem to be: sometimes OpenAQ has -999 while the dataset from the embassy website has more credible values.

rplot

Here you see the negative measures, I'll make a summary of some sort for the repeated measures.

So a small summary of the issue. I'm looking at 5463 non missing measures between 2015-12-12 03:00:00 and 2016-08-01 00:00:00. 145 are different between the embassy website and OpenAQ. So not a lot but still interesting, do you want to add the "right" value with a corresponding flag as suggested by @RocketD0g (real time vs not)? (edited)

I've just looked at the dates at which non concordant values appear, and it's not at a given period, last time was in July this year.

RocketD0g commented 7 years ago

This is something we would like to do in the future - have a QA/QC'ed layer in addition to the real-time gov layer. Current constraints: (1) Time it would take to re-format existing data points in system, and (2) Financial: this could, in theory, greatly expand our database with another layer of data (from sources/locations) already in our system. There were would be significant costs associated with this if we were to duplicate existing databases that have this information. One question might be: in the mean time, rather than to duplicate these databases (again due to time + financial constraints), how do we systematically point people do these databases, in addition to tagging the source?

On the specific differences @masalmon noticed: it's odd to me that '-999' got taken out and replaced with other numbers. '-999' is the symbol for 'my instrument isn't reporting data.' Measurements may drift systematically, meaning QA/QC'ed numbers may be adjusted, but I don't see how '-999' numbers are adjusted. Perhaps the instrument itself was truly measuring numbers but the measurements weren't making it out or something?

cc: @jflasher @masalmon

maelle commented 7 years ago

When I thought of this I realized this would mean having a more crowd-sourced aspect to the data. ;-)

maelle commented 7 years ago

I also assumed that the instrument was working but not the display, maybe we could ask directly? I think some of us have some contacts :grin: