mi3nts / AirQualityAnalysisWorkflows

Repo for the Fall 2021 Senior Design Group.
MIT License
5 stars 0 forks source link

Ingestion pipeline for OpenAQ data #12

Closed john-waczak closed 1 year ago

john-waczak commented 1 year ago

To make more data sources available, we should develop an ingestion pipeline for data using the OpenAQ API. Specifically, we will need to create a new NodeRed flow to query data from OpenAQ utilizing their API and then inject that data into InfluxDB. Additionally, we will need to provision a new bucket in InfluxDB so that there is a clear separation between our sensor data and data from OpenAQ. This bucket should be initialized upon creation of the containers with additional environment variables added to the .env file as needed.

john-waczak commented 1 year ago

Suggestion for provisioning new bucket:

Update makefile for influxdb build step to include a call involving docker exec and the tool influxdb bucket create. See this link

HudsonApel commented 1 year ago

OpenAQ bucket still needs to be added. Also note, for buckets with the free plan, you can only have two 2 buckets (excluding _monitoring and _tasks buckets)

john-waczak commented 1 year ago

I think that limit only exists for influxdb cloud, i.e. running it on their paid-for platform. We shouldn't have an issue with the oss version: https://community.influxdata.com/t/buckets-and-measurements-limit-number/25483

HudsonApel commented 1 year ago

That's good to know, I'm guessing that means that the limit of 5MB per 5 minutes doesn't apply, and while I've made sure to be well under that, I still saw a few timed out messages when writing new data every once in a while, so still something to figure out.

Also when looking at the times, we do parse the time and pass that in, and looking at the data after we parse it, it is accurate, but on my machine influxdb seems to usually round it down to the nearest hour.

davidlary commented 1 year ago

Thanks for paying attention to the details. Usually best to capture to the second.DavidProf. David J Lary Cell: +1 (214) 498-3866 William B. Hanson Center for Space Science Office: SCI 3.130 800 W. Campbell Rd. Richardson TX 75080-3021 @. http://www.davidlary.infohttp://mints.utdallas.eduhttps://www.linkedin.com/in/davidlary/On Feb 20, 2023, at 8:50 AM, Hudson Apel @.> wrote: That's good to know, I'm guessing that means that the limit of 5MB per 5 minutes doesn't apply, and while I've made sure to be well under that, I still saw a few timed out messages when writing new data every once in a while, so still something to figure out. Also when looking at the times, we do parse the time and pass that in, and looking at the data after we parse it, it is accurate, but on my machine influxdb seems to usually round it down to the nearest hour.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

HudsonApel commented 1 year ago

Correction for the time capture, we figured out that influxdb is getting the right time info, and is showing properly. What had confused me is that, When looking at the data, I was always adjusting the Time Range in influxdb to show me data from past 30 days, but if you set the time range farther out, influxdb starts rounding it when it displays it in tables

john-waczak commented 1 year ago

Gotcha, are y'all nearly ready to submit a PR then?

HudsonApel commented 1 year ago

I think by the Friday or the end of the weekend we should be finished. We decided to implement filtering out the non updated data so that we can more efficiently write to influxdb, and also potentially improving the queueing system a bit. We are also still working on adding the bucket by default.

john-waczak commented 1 year ago

Excellent progress!