openaq / openaq-data-format

A description of the data format provided by the OpenAQ platform.
MIT License
29 stars 4 forks source link

Adding in field(s) that reflect RT vs historical/backfilled + QA/QC (for non-RT data) #2

Open RocketD0g opened 8 years ago

RocketD0g commented 8 years ago

Suggest a 'Data Type' Field with four categories, such as:

  1. Real-Time: Any data we currently ingest into the system, and by definition is not QA/QC
  2. Historical/QA-QC: Backfilled historical data that has gone through QA/QC (e.g. EEA or EPA non-RT data, possibly from researchers)
  3. Historical/No QA-QC: 'Raw data'
  4. Historical/Unknown

Or perhaps this is too complicated and it should be broken down into two fields: one RT vis Historical and the other QA/QC: Yes/No/Unknown?

The motivation for this suggested change: Eventually, we will want to be able to backfill data from sources, fill in holes, or take data from sources (e.g. gov't agencies, researchers) that would rather only shared QA/QC'ed data. When using data that is not real-time, and especially from gov't sources, it will be QA/QC'ed unlike the real-time data we are collecting. For this reason, it would be good to have a field that reflects these differences in known data quality. We have gotten requests for this feature.

We have also gotten a related request to provide info on the exact QA/QC procedures of a given place. That'd be awesome, but at this time, I think it will be difficult to parse precisely the QA/QC controls used by each place, and I think it is unreasonable for us to do that at this time or in the near future. Plus, a user can find the data source agency to contact them for more information.

cc: @olafveerman @dolugen @jflasher - I'll be making a series of these for discussion (and using a new label, dark blue 'v2'.) Will be interested in your thoughts on these and other possible changes to the format for v2.

olafveerman commented 8 years ago

I definitely see the value of having a verified: true / false

Other questions comments:

"bulk_upload": { 
  "note": "info about the source",
  "date": 2015-02-12 
}
olafveerman commented 8 years ago

Apart from the data standard side, it will be interesting to see how we can reconcile this data. Should we toss out the unverified measurements for the same timestamp? How do we do that when timestamps may not be the same?