Description of a Simple Averaging Tool

RocketD0g commented 4 years ago

Below is a quick diagram of what a simple averaging tool, accessible via an API endpoint could look like, in terms of:

Possible outputs (e.g. city-level daily average of PM2.5) via an 'averaging' API endpoint.
Steps necessary to clean data (Ideally, we'd clean using all the functionality described in openaq-quality-checks, but may not be the most expedient route, and simply making sure measurement units are consistent, as well as data is only included if GTE to 0, could be sufficient for now.
Details on averaging method (e.g. average over the defined temporal scale for a single station first, before aggregating up to multiple stations comprising a city or country).
Details on some useful associated stats with the calculation. Small adjustment to image: Having a stat that gave a sense of % time stations were reporting would be helpful, probably more so than limiting to only stations reporting a specific - say 75% - proportion of the time.

@jflasher - maybe you could share the SQL query you used to play around with this?

cc: @sruti

jflasher commented 4 years ago

Query I used to generate the yearly average for every city in the system. I'm not a SQL wizard, so this could potentially (probably?) be optimized. There are a couple of nested queries that can be pulled out to modify this to different levels of aggregation. I ran this on the Parquet data referenced in https://github.com/openaq/openaq-fetch/issues/521 and it took ~30 seconds and scanned 220MB which costs around $0.001!

SELECT SUBSTRING(date, 1, 4) AS year, 
              country, city, 
              sum(count) AS count, 
              avg(average) AS average
FROM 
    (SELECT date,
         sum(count) AS count,
         avg(average) AS average,
         city,
         country
    FROM 
        (SELECT date_format(from_iso8601_timestamp(date_local),
         '%Y-%m-%d') AS date, location, count(*) AS count, avg(value) AS average, city, country
        FROM fetches_realtime_parquet
        WHERE parameter = 'pm25'
                AND value >= 0
                AND unit = 'µg/m³'
        GROUP BY  date_format(from_iso8601_timestamp(date_local), '%Y-%m-%d'), location, city, country )
        GROUP BY  date, city, country )
    GROUP BY  country, city, SUBSTRING(date, 1, 4)
ORDER BY  SUBSTRING(date, 1, 4) asc, country asc, city asc;

sruti commented 4 years ago

Thinking about what the /averages endpoint could look like (building off of what @abarciauskas-bgse proposed in issue #2), drafted request parameters and a response format

GET Request parameters: country= (required) date_from= (required) date_to= (required) spatial_resolution=location, city, country (required) temporal_resolution= daily, monthly, yearly (required) city=(required if spatial_resolution=city, otherwise optional to narrow results) location= (optional, for specific location, can be location name or location_id) parameter= (optional, default: all parameters)

Response:

results: [
    { 
        parameter: , (pm25, co, ...),
        location: , (if applicable, based on request)
        date: ,   (2019 or 2019-12 or 2019-12-10)
        value: ,
        unit: ,
        location_count: , (# of stations used in calculation) 
        measurement_count: (# of measurements used in calculation)
    }
]

Anything missing? Thoughts on what parameters should be required or how the response should be formatted? Better names for parameters (location = station, but is it confusing in this context)?

RocketD0g commented 4 years ago

Thanks for this post @sruti!

Some thoughts:

For the inputs: --Would it be difficult to include a global average for a given pollutant? If it's relatively easy, it'd be nice to include this. Like the country average, the user has to recognize major deficiencies in such an average, but it is still interesting in its own right. For instance, it would be interesting to know if the global average of all types of pollutants this year takes a steep dive relative to the past five years.
Displayed results: --include output that shows the selected averaging period: [e.g. yearly, monthly, daily] --include output that shows the selected averaging region: [e.g. location, city, country] (and then replace the optional location result above) --include list of locations used for calculation, e.g. locations_used -- date can be converted to date range, listing the start and end date of the average, since one date would be unclear -- put the parameter result directly above the value and unit -- It'd be great to include a value such as temporal coverage in the results. This value would calculate the % of measurements that go into a calculation, based on the actual measurement count and then the total number of counts possible (which could be calculated from locations (e.g. stations), the time period and the reported averaging period of the selected location(s) (e.g. is the data hourly /not hourly)). I know not all locations have averagingPeriods associated with them, and any measurements that are incorporate such locations would just have to not report a number.

So all together, something like:

GET Request parameters: country= (required) date_from= (required) date_to= (required) spatial_resolution=location, city, country, global (required) temporal_resolution= daily, monthly, yearly (required) city=(required if spatial_resolution=city, otherwise optional to narrow results) location= (optional, for specific location, can be location name or location_id) parameter= (optional, default: all parameters)

results: [ { date range: , (01-01-2019 to 12-31-2019) avg_temporal_res: , (yearly, daily, monthly) avg_spatial res: , (location, city, country) parameter: , (pm25, co, ...) ,
value: , unit: , locations_used: , (list of stations used for calculation) location_count: , (# of stations used in calculation) measurement_count: (# of measurements used in calculation) } ]

Thoughts on this, @sruti? @jflasher - you've been thinking about this for awhile. Any thoughts/reactions?

jflasher commented 4 years ago

Looks like a great start!

location= (optional, for specific location, can be location name or location_id)

How would you know the difference between the two? If we call them two different things in the existing API, I'd just be in favor of keeping it that way. So you can pass location OR location_id. And actually, if we want people to start moving to location_id, maybe that's the only one we support?

I think there will be some fun logic to handle around how the dates get handled across the different files, but seems doable.

@RocketD0g can you explain a bit more about what you're looking for with the global average? Is this already covered by a yearly country average for PM25? Or is it something else?

RocketD0g commented 4 years ago

@jflasher - It'd be a true global average for all stations reporting over the course of the year, from all countries for where data are reporting in the OpenAQ system.

lstanton428 commented 4 years ago

You'll need to incorporate a data completeness check otherwise you could have a single hour representing an entire year. Typically we use a threshold of 75%, but this could be an input that defaults to 75% or something similar.

RocketD0g commented 4 years ago

@lstanton428 Yeah, that's a really good idea.

We've been thinking to have something that would indicate % completeness of measurements, but not necessarily taking the reported value out if it doesn't meet some threshold - e.g. the user would have to apply their own discretion.

But maybe the better way would be to make some default - like 75% apply, and then one could alter the query to otherwise set the threshold, if they preferred something higher or lower?

lstanton428 commented 4 years ago

You all know your users better than I do, but if there's a chance that they won't filter out data with low completeness, I think it would be best to default to a certain completeness threshold. The API could still return completeness statistics, so if someone wanted all of the data they could just set completeness at 0%.

RocketD0g commented 4 years ago

Comments from a Community Member on how they process Indian AQ data before averaging:

Values to avoid when averaging -- India AQ data all negatives and zeroes integer values equal to 999, 1985, 985, 915, 515 -- these are some of the anomalies we found that appear regularly. The variation is among the suppliers.

I also avoid pm2.5 and pm10 < 4 -- detection limits so2 and no2 < 2 o3 < 0.5

in co -- very strangely, some stations present the data as mg/m3 and some ug/m3 so, any co value under 10, I multiple by 1000 to bring it up to ug/m3 (under the assumption that co will never by more than 10 mg/m3 -- possible sometimes)

abarciauskas-bgse commented 4 years ago

This all looks like good and useful detail for the API spec and implementation. It definitely sounds non-trivial but having the details about how to clean data and generate useful averages using data completeness. I don't have anything to add at this time but have the following questions:

About averages implementation:

@RocketD0g regarding your comment on "check units match" - do you think having different units is only a problem with CO or should this be done for all pollutants?
Regarding data cleaning, it seems clear to avoid negative values and zeroes, should we also use the other thresholds in @RocketD0g most recent comment?

About API implementation:

If we include "locations_used" could this potentially be a very long list that should be made optional to reduce response body size?

About technical execution:

I think @sruti architecture design is still TBD is that right? Certainly it makes sense to get the API pretty well defined before designing the architecture. And also the parquet implementation you are also working on it makes sense for that to come first.
Has there been any discussion about back-processing (e.g. generating past averages)?
This is more of a suggestion than a question and may be too early to suggest this but I would suggest starting with a prototype of 1 location and then 1 city with many locations etc.

RocketD0g commented 4 years ago

In response to @abarciauskas-bgse's comments/q's above:

About averages implementation:

@RocketD0g regarding your comment on "check units match" - do you think having different units is only a problem with CO or should this be done for all pollutants?

The units check will be needed for any of the gas phase pollutants in the system, so: O3, NO2, CO, SO2

Regarding data cleaning, it seems clear to avoid negative values and zeroes, should we also use the other thresholds in @RocketD0g most recent comment?

Will be interested in others' comments, but from working on #4 , these seem like more universal guidelines:

Take out:

integer values equal to 999, 1985, 985, 915, 515 -- these are some of the anomalies we found that appear regularly. The variation is among the suppliers.
take out all negatives

While a suggestion was made to remove measurements < 4 ug/m^3 for PM in India, this seems to be too stringent a cut for places where AQ is much cleaner and values regularly are 4ug/m3 or lower (e.g. Austaulia, Canada, etc.). Suggest we just remove physically unrealistic values (e.g. negative values, say what we're doing, and leave it to users with more context and working at more country-specific levels to further remove data, if they like.

lstanton428 commented 4 years ago

It's not uncommon for an instrument that is operating within specification in a clean environment to report slightly negative numbers because of instrument noise. Removing these is statistically invalid, because it may bias measurements high in clean environments.

U.S. EPA has a decent amount on this, though you have to dig around a bit to find it.

We have recently received several inquiries about the handling of small negative values for gaseous parameters (particularly 5-minute SO2). Previously, the practice had been to replace the negative value with a zero, and to flag the value with the qualifier code '9'. The OAQPS monitoring group has determined that this introduces statistical bias into the computations. Effective immediately, AQS will accept negative values down to the negative of the Method Detection Limit (MDL). AQS will reject values lower than the negative of the MDL. Data submitters are requested to report negative values (down to the -MDL) and not do a zero substitution. Users are advised to replace more negative values with a null data code (e.g. 'DA'). AQS Addendum to Negative values

The most recent technical note on gaseous pollutants that I could find sets the thresholds as such:

AQS will use the following limits until further notice.

SO2 ‐4.0 ppb

O3: ‐4.0 ppb

NO2: ‐5.0 ppb

CO: ‐0.4 ppm Technical Note - Reporting Negative Values for Criteria Pollutant Gaseous Monitors to AQS

This presentation discusses negative PM measurements. CARB also has a bulletin on this that points to this table, which has MDL and minimum reported values for just about every FEM/FRM.

openaq / openaq-averaging

Description of a Simple Averaging Tool #3