Open RocketD0g opened 4 years ago
Query I used to generate the yearly average for every city in the system. I'm not a SQL wizard, so this could potentially (probably?) be optimized. There are a couple of nested queries that can be pulled out to modify this to different levels of aggregation. I ran this on the Parquet data referenced in https://github.com/openaq/openaq-fetch/issues/521 and it took ~30 seconds and scanned 220MB which costs around $0.001!
SELECT SUBSTRING(date, 1, 4) AS year,
country, city,
sum(count) AS count,
avg(average) AS average
FROM
(SELECT date,
sum(count) AS count,
avg(average) AS average,
city,
country
FROM
(SELECT date_format(from_iso8601_timestamp(date_local),
'%Y-%m-%d') AS date, location, count(*) AS count, avg(value) AS average, city, country
FROM fetches_realtime_parquet
WHERE parameter = 'pm25'
AND value >= 0
AND unit = 'µg/m³'
GROUP BY date_format(from_iso8601_timestamp(date_local), '%Y-%m-%d'), location, city, country )
GROUP BY date, city, country )
GROUP BY country, city, SUBSTRING(date, 1, 4)
ORDER BY SUBSTRING(date, 1, 4) asc, country asc, city asc;
Thinking about what the /averages
endpoint could look like (building off of what @abarciauskas-bgse proposed in issue #2), drafted request parameters and a response format
GET Request parameters:
country=
(required)
date_from=
(required)
date_to=
(required)
spatial_resolution=location, city, country
(required)
temporal_resolution= daily, monthly, yearly
(required)
city=
(required if spatial_resolution=city, otherwise optional to narrow results)
location=
(optional, for specific location, can be location name or location_id)
parameter=
(optional, default: all parameters)
Response:
results: [
{
parameter: , (pm25, co, ...),
location: , (if applicable, based on request)
date: , (2019 or 2019-12 or 2019-12-10)
value: ,
unit: ,
location_count: , (# of stations used in calculation)
measurement_count: (# of measurements used in calculation)
}
]
Anything missing? Thoughts on what parameters should be required or how the response should be formatted? Better names for parameters (location = station, but is it confusing in this context)?
Thanks for this post @sruti!
Some thoughts:
For the inputs: --Would it be difficult to include a global average for a given pollutant? If it's relatively easy, it'd be nice to include this. Like the country average, the user has to recognize major deficiencies in such an average, but it is still interesting in its own right. For instance, it would be interesting to know if the global average of all types of pollutants this year takes a steep dive relative to the past five years.
Displayed results:
--include output that shows the selected averaging period: [e.g. yearly, monthly, daily]
--include output that shows the selected averaging region: [e.g. location, city, country] (and then replace the optional location
result above)
--include list of locations used for calculation, e.g. locations_used
-- date
can be converted to date range
, listing the start and end date of the average, since one date would be unclear
-- put the parameter
result directly above the value
and unit
-- It'd be great to include a value such as temporal coverage
in the results. This value would calculate the % of measurements that go into a calculation, based on the actual measurement count
and then the total number of counts possible (which could be calculated from locations (e.g. stations), the time period and the reported averaging period of the selected location(s) (e.g. is the data hourly /not hourly)). I know not all location
s have averagingPeriods
associated with them, and any measurements that are incorporate such locations would just have to not report a number.
So all together, something like:
GET Request parameters: country= (required) date_from= (required) date_to= (required) spatial_resolution=location, city, country, global (required) temporal_resolution= daily, monthly, yearly (required) city=(required if spatial_resolution=city, otherwise optional to narrow results) location= (optional, for specific location, can be location name or location_id) parameter= (optional, default: all parameters)
results: [
{
date range: , (01-01-2019 to 12-31-2019)
avg_temporal_res: , (yearly, daily, monthly)
avg_spatial res: , (location, city, country)
parameter: , (pm25, co, ...) ,
value: ,
unit: ,
locations_used: , (list of stations used for calculation)
location_count: , (# of stations used in calculation)
measurement_count: (# of measurements used in calculation)
}
]
Thoughts on this, @sruti? @jflasher - you've been thinking about this for awhile. Any thoughts/reactions?
Looks like a great start!
location= (optional, for specific location, can be location name or location_id)
How would you know the difference between the two? If we call them two different things in the existing API, I'd just be in favor of keeping it that way. So you can pass location
OR location_id
. And actually, if we want people to start moving to location_id
, maybe that's the only one we support?
I think there will be some fun logic to handle around how the dates get handled across the different files, but seems doable.
@RocketD0g can you explain a bit more about what you're looking for with the global average? Is this already covered by a yearly country average for PM25? Or is it something else?
@jflasher - It'd be a true global average for all stations reporting over the course of the year, from all countries for where data are reporting in the OpenAQ system.
You'll need to incorporate a data completeness check otherwise you could have a single hour representing an entire year. Typically we use a threshold of 75%, but this could be an input that defaults to 75% or something similar.
@lstanton428 Yeah, that's a really good idea.
We've been thinking to have something that would indicate % completeness of measurements, but not necessarily taking the reported value out if it doesn't meet some threshold - e.g. the user would have to apply their own discretion.
But maybe the better way would be to make some default - like 75% apply, and then one could alter the query to otherwise set the threshold, if they preferred something higher or lower?
You all know your users better than I do, but if there's a chance that they won't filter out data with low completeness, I think it would be best to default to a certain completeness threshold. The API could still return completeness statistics, so if someone wanted all of the data they could just set completeness at 0%.
Comments from a Community Member on how they process Indian AQ data before averaging:
Values to avoid when averaging -- India AQ data all negatives and zeroes integer values equal to 999, 1985, 985, 915, 515 -- these are some of the anomalies we found that appear regularly. The variation is among the suppliers.
I also avoid pm2.5 and pm10 < 4 -- detection limits so2 and no2 < 2 o3 < 0.5
in co -- very strangely, some stations present the data as mg/m3 and some ug/m3 so, any co value under 10, I multiple by 1000 to bring it up to ug/m3 (under the assumption that co will never by more than 10 mg/m3 -- possible sometimes)
This all looks like good and useful detail for the API spec and implementation. It definitely sounds non-trivial but having the details about how to clean data and generate useful averages using data completeness. I don't have anything to add at this time but have the following questions:
About averages implementation:
About API implementation:
About technical execution:
In response to @abarciauskas-bgse's comments/q's above:
About averages implementation:
- @RocketD0g regarding your comment on "check units match" - do you think having different units is only a problem with CO or should this be done for all pollutants?
The units check will be needed for any of the gas phase pollutants in the system, so: O3, NO2, CO, SO2
- Regarding data cleaning, it seems clear to avoid negative values and zeroes, should we also use the other thresholds in @RocketD0g most recent comment?
Will be interested in others' comments, but from working on #4 , these seem like more universal guidelines:
Take out:
While a suggestion was made to remove measurements < 4 ug/m^3 for PM in India, this seems to be too stringent a cut for places where AQ is much cleaner and values regularly are 4ug/m3 or lower (e.g. Austaulia, Canada, etc.). Suggest we just remove physically unrealistic values (e.g. negative values, say what we're doing, and leave it to users with more context and working at more country-specific levels to further remove data, if they like.
It's not uncommon for an instrument that is operating within specification in a clean environment to report slightly negative numbers because of instrument noise. Removing these is statistically invalid, because it may bias measurements high in clean environments.
U.S. EPA has a decent amount on this, though you have to dig around a bit to find it.
We have recently received several inquiries about the handling of small negative values for gaseous parameters (particularly 5-minute SO2). Previously, the practice had been to replace the negative value with a zero, and to flag the value with the qualifier code '9'. The OAQPS monitoring group has determined that this introduces statistical bias into the computations. Effective immediately, AQS will accept negative values down to the negative of the Method Detection Limit (MDL). AQS will reject values lower than the negative of the MDL. Data submitters are requested to report negative values (down to the -MDL) and not do a zero substitution. Users are advised to replace more negative values with a null data code (e.g. 'DA'). AQS Addendum to Negative values
The most recent technical note on gaseous pollutants that I could find sets the thresholds as such:
AQS will use the following limits until further notice.
- SO2 ‐4.0 ppb
- O3: ‐4.0 ppb
- NO2: ‐5.0 ppb
- CO: ‐0.4 ppm Technical Note - Reporting Negative Values for Criteria Pollutant Gaseous Monitors to AQS
This presentation discusses negative PM measurements. CARB also has a bulletin on this that points to this table, which has MDL and minimum reported values for just about every FEM/FRM.
Below is a quick diagram of what a simple averaging tool, accessible via an API endpoint could look like, in terms of:
@jflasher - maybe you could share the SQL query you used to play around with this?
cc: @sruti