openaq / openaq-averaging

A repo focused on determining longer-term averages at varying geospatial scales from data accessed from the OpenAQ Platform.
8 stars 1 forks source link

Creating an /averages endpoint #2

Open abarciauskas-bgse opened 5 years ago

abarciauskas-bgse commented 5 years ago

To make the work of generating averages and other statistics at various temporal and spatial resolutions, OpenAQ could have a new API endpoint which returns air quality measurement averages given a set of parameters. With such an endpoint, users no longer have to query or download the data and then parse and clean the data to get the values of interest for reporting.

This issue proposes to prototype such an endpoint using a separate AWS account running Athena against the OpenAQ S3 bucket fetches_realtime_gzipped, and then to proceed as follows:

High level steps

  1. Clone or fork and run the existing node.js OpenAQ API (for local development)
  2. Add an endpoint /stats or /averages which queries Athena using a set of parameters provided by the user, and returns an S3 location (follows Athena asynchronous request / response cycle)
  3. (V2) Has an option to clean the data: If users provide some parameters to clean the data (e.g. removing negative or repeating values), the endpoint would either generate a query to Athena for the averages minus the cleaned values, or if the "cleaning" operation is complex (such as removing repeating values), this will include a more complex workflow: The endpoint would still return an S3 endpoint which the workflow would eventually write to. The endpoint would then query Athena for values (as apposed to aggregations) and the output of this query would be to an S3 location which would kick off a job to clean and generate statistics from the data, and then write to the final S3 location for the user.

The API endpoint will produce one or more averages for a variety of parameters:

Additional work:

In addition to the functionality of the /averages endpoint, the additional work could be done:

I'm attaching some whiteboard-ing from the last DataKind DC datajam in case they are helpful.

IMG_20190613_210406 (2) IMG_20190613_210342 (1) IMG_20190613_210335 (1)

cc @RocketD0g @olafveerman @jflasher @dominicwhite @minh5