Creating an /averages endpoint

To make the work of generating averages and other statistics at various temporal and spatial resolutions, OpenAQ could have a new API endpoint which returns air quality measurement averages given a set of parameters. With such an endpoint, users no longer have to query or download the data and then parse and clean the data to get the values of interest for reporting.

This issue proposes to prototype such an endpoint using a separate AWS account running Athena against the OpenAQ S3 bucket fetches_realtime_gzipped, and then to proceed as follows:

High level steps

Clone or fork and run the existing node.js OpenAQ API (for local development)
Add an endpoint /stats or /averages which queries Athena using a set of parameters provided by the user, and returns an S3 location (follows Athena asynchronous request / response cycle)
(V2) Has an option to clean the data: If users provide some parameters to clean the data (e.g. removing negative or repeating values), the endpoint would either generate a query to Athena for the averages minus the cleaned values, or if the "cleaning" operation is complex (such as removing repeating values), this will include a more complex workflow: The endpoint would still return an S3 endpoint which the workflow would eventually write to. The endpoint would then query Athena for values (as apposed to aggregations) and the output of this query would be to an S3 location which would kick off a job to clean and generate statistics from the data, and then write to the final S3 location for the user.

The API endpoint will produce one or more averages for a variety of parameters:

spatial resolution: user can pass location=, city=, or country (V2: coordinates= and radius=)
temporal resolution: user can define if they want temporal_resolution= of daily, weekly, monthly, yearly
time span: user provides a start_date= and end_date
parameter: user can pass one or more parameters[]= to return averages for (see https://docs.openaq.org/#api-Parameters)

Additional work:

In addition to the functionality of the /averages endpoint, the additional work could be done:

returning other types of stats (histograms, std deviations, etc)
data could be transformed to parquet for faster response times
visualizations could be built on top of the averages data to showcase the endpoint

I'm attaching some whiteboard-ing from the last DataKind DC datajam in case they are helpful.

IMG_20190613_210406 (2) IMG_20190613_210342 (1) IMG_20190613_210335 (1)

cc @RocketD0g @olafveerman @jflasher @dominicwhite @minh5

openaq / openaq-averaging

Creating an /averages endpoint #2

High level steps

Additional work: