openaq / openaq-data-format

A description of the data format provided by the OpenAQ platform.
MIT License
29 stars 4 forks source link

Feedback from Multitude #7

Open RocketD0g opened 8 years ago

RocketD0g commented 8 years ago

Nick Masson of Mulitiude gave us this feedback and q's on our data format + API. Putting this as one big issue for now, will re-visit in a week or so; thought it might be of interest to others.

@jflasher - please give your feedback on these answers before I email Nick back.

Answering inline:

1) Do you have a list showing, for example, all of the "attribution" sources, the "adapter" associated with that source, and a layman description of the source?

We have a short description by name (e.g. 'US EPA/AirNow) and also the URL for the originating source - all under the 'attribution' field. At this stage, all sources on our platform originate from governmental bodies. You can find more about that here: https://docs.openaq.org/#api-Sources

If you mean more info on the instrument types, calibration procedures etc.: We know this would be valuable information for many folks, but we have no way to systematically or reliably get that from nearly any available governmental source (whether it is currently in our system or not).

2) correct me if i'm wrong, but I assume the attribution field describes where the data is coming from? What is the "sourceName".

Yes, that's correct. the sourceName is just a way to refer to the specific source files here: https://github.com/openaq/openaq-fetch/tree/develop/sources

3) Our immediate use case would be to be able to quickly pull data from all of the US regulatory stations, for any or all of the parameters they measure. It is important that we can identify if the data is from a US regulatory station, or other source. I assume that the "attribution" field and "url" is consistent, so, for example, might be "AirNow" and "www.airnow.gov" for all of the US regulatory data that is acquired real-time?

Yes, the attribution field is consistent but our system is not designed to let you search by attribution. Two thoughts though on this:

(a) Currently, all real-time data aggregated to our system is from AirNow. (And it should be noted all data currently on our platform is from governmental sources)The exception to this is a few months worth of data that we aggregated from Houston, TX before adding in the AirNow sources (this data was collected by local EPA, but it may or may not have been used for regulation). We no longer aggregate from this source, though. If you wanted, currently, if you used the API to filter by the country field in real-time (or historically, with the exception of Houston, TX), you would only get AirNow data.

(b) But of course we plan to add in other data source types (e.g. research-grade and low cost sensors), and our system will need to indicate this. Currently, we've sketching out a very simple system to differentiate these types. See: Issue #8 This will be helpful to see if data sources are from a governmental level versus a researcher, but it won't let you know if the source is specifically used for regulatory purposes versus others. We would have trouble being able to distinguish this for many if not most countries.

Similarly, if you were to pull in data from the same regulatory stations, but have it be post-QA/QC, we could discern between the two data-sets by sorting on the "attribution" field?

At this stage, frankly, we don't have plans to pull in post-QA/QC data. Our main goal is to capture data that would otherwise be lost for the record. This is not the case for US EPA data, obviously, but it is a useful data set for people to build complementary tools on and to compare with. That said, if the community says this is a 'must' we'll see if we can make it happen, and we would need a tag on our data format that indicated whether data was pre or post QA/QC.

4) Can the date_from and date_to accept datetime formatted to the second (e.g. "2016-05-07T12:44:22.556Z"). I am aware that most of the data is fairly low frequency (hourly averages), but out system would be windowing on time intervals to make sure we get the entire time-series (we do batch processing on consecutive time-intervals of data, and can be sure not to miss anything if we bracket down to accurate time-intervals).

Yup, it should accept down to the second. It should accept anything in the ISO 8601 standard.

5) It would be very useful if you include a field that tells how the time is averaged. Different organizations average differently. For example, is the one hour average centered on the timestamp, or forward or backward looking (e.g. forward looking would have 12:00 represent data averaged between 12:00 and 12:59). Not sure if you have this info, or would be willing to go through and contact your various sources to find out. For us, it's crucial when cross-comparing different data.

Hear you on this. We need to add another field at minimum that differentiates reporting frequency and average. I think we will have trouble - from a sheer communication standpoint with governmental agencies- getting information down to the forward or backward-looking for many sources, but it could be something we could do at least for the larger sources. Here's a separate issue I've created on it here. Also, do you know which way it is done for the US EPA Airnow data? I believe it is timestamped with the ending time (e.g. data taken between 3pm and 4pm is marked 4pm)

Addendum: As I apparently forgot, we do have a protocol in place that we define the timestamp for the average: the ending time for a given average is what we timestamp a measurement with. For example, a timestamp for a measurement taken between 3pm and 4pm will be given a timestamp of 4pm. That said, I know it is probably the case where we access data from sites that only has a single timestamp and it is not readily apparent if this is a beginning, middle, or end time stamp. More here on our format: https://github.com/openaq/openaq-api/wiki/4.-Writing-an-adapter#dealing-with-dates-and-date-ranges

6) Can you include a "totalPages" field in the return JSON? This would help with the logic in pulling data down on our side -- we'll know a priori how many times we need to loop over the pagination. Else we would have to do some more adhoc coding to infer it ourselves.

I think you should be able to get this from dividing 'found' by limit' in the meta data returned (see image below): screen shot 2016-06-25 at 4 18 40 pm

If I am misunderstanding that q, let me know.

7) It would be useful to be able to query on the "attribution" or "url" for attribution. I can definitely see us cross-referencing a list as in "1)" and wanting to query data for just one source.

This is good feedback; I've made an issue on that. https://github.com/openaq/openaq-api/issues/256

I also do wonder if this will, in some sense, be solved by this issue: https://github.com/openaq/openaq-data-format/issues/8

Perhaps not completely though...

8) do you also have a list of all possible pollutants/parameters that are in your schema?

These are currently the ones we capture: PM2.5, PM10, CO, O3, NO2, SO2 and BC (though BC data is the rarest we find).

We don't have immediate plans to expand or truncate this list, but the most current listing should always be here: https://docs.openaq.org/#api-Measurements

If you have feedback on any pollutants you would find useful to be included, let us know. We tend to default to the most common collected globally rather than to get the rarer types some places measure, like benzene, etc.

9) it would be great if all the units were standardized -- we do this in our system, and it's fairly painful, but worth doing at the base level of data ingestion. i.e. only deal in ppb or ppm for certain pollutants, etc... Otherwise we need to write logic on our side that x-references the measurement with the units, and if the units aren't our standard unit, then convert the units against a mapping for different unit conversion.

So, one thing we have to stick to very closely in our system is the precise way the data is shared on the originating site. We think it's important to always have the 'raw' data as it appears saved to our system. We do make conversions from ppb to ppm, say for volume concentrations (so we share all measurements in ppb as ppm - you can see the preferred units here: https://github.com/openaq/openaq-data-format).

BUT, as you notice, we don't convert when an ozone measurement, for example, is made in ug/m^3. We don't do this because of the assumptions of P and T we would have to make at each locations globally. We find this to be a bit of a pain in the butt, too. :) But again, we prioritize having the dataset shared openly and transparently from its originating sources with no assumptions applied on our part.

However, this is something that can be added on top of our system, and clearly would be a useful one, even if done for a region and not globally. I'm making an issue here. If you/your team were to dig into that piece more with some open-source code, we'd advertise widely through our network the tool you generate, write a blog post about or work together, and do anything we could do to call out such awesomeness.