openaq / openaq-upload

Batch uploader for OpenAQ
MIT License
2 stars 0 forks source link

Historical air quality data collecting, processing and hosting #21

Closed ReaRuiRay closed 4 years ago

ReaRuiRay commented 4 years ago

Hey team, thanks for your awesome works on OpenAQ.

This is Ruixiong (Ray) Zhang and as an active scientist on atmospheric chemistry / air pollution topic, I would like to explore the opportunity and feasibility to collect historical air quality data around the globe from different research groups and agencies.

The idea is to initiate a openaq-historical project/repo under OpenAQ. I would like to share my thoughts on this idea and discuss with you.

Data collecting: Most research groups have scraped governmental data since long. I have good academic relationships with different research groups across the globe. I can lead here to contact different groups and ask them if it is possible to share their data. In return, they would be compensated by acknowledgement or coauthorship or different academic credits when researches are published by historical data users.

Data processing: Different data come in different flavors and we see this problem in processing realtime fetching. They need to be processed with academic flavors before publishing. We can have a preliminary consensus on data format (we add both realtime fetching formats and academic preferred formats) I can use my PC to process all data, it won't be a problem. Algorithms and raw data (if permission granted by contributors) would be published.

Data storage: The historical data don't need to be served via API or query or database as usually researchers care about them and they would want the data in csv or other formats. Hereby, the data may only need to be hosted on a AWS S3 bucket. I roughly estimate the total air quality data around the globe after processing to be ~100Gb in raw csv formats and 20Gb after compressing. So the cost to host such data on AWS S3 bucket would be minimized. If storage budget would be a problem, I can organize with some research groups to fund it.

I believe the historical archive will get more scientists' interest and lead to more fruitful researches, which would bring more society awareness to air pollution and OpenAQ platform.

This issue is more like a proposal to initiate the project. Please share your thoughts/comments and let me know if this is an existing project. Tagging @RocketD0g @jflasher for transparency & more discussions.

RocketD0g commented 4 years ago

In general, building the tool @ReaRuiRay describes - and this repo more broadly - is very inline with OpenAQ goals we set out in 2019.

Giving researchers the ability to upload data is a goal, no question, the question is how to go about it sustainably, inline with our existing mission, and in a manner consistent with how we have been sharing to date. I've put a couple points out here for discussion. For the sake of discussion, I've defined the "historical" data @ReaRuiRay describe as more broadly fitting into the research data in our system according to our existing data format.

Short version: I suggest keeping the same format already used on OpenAQ for research data and not hedging the research data off into S3 buckets exclusively so that the data are more fully accessible and findable alongside other sources, which is our main goal. This does mean there are some cost considerations, but perhaps not prohibitive. I'd be hesitant to post data on OpenAQ that is scraped from third parties from years prior for transparency and source attribution reasons (and possibly legal reasons).

Longer version

  1. Is it appropriate for OpenAQ to host data scraped by other parties where the code they used cannot be verified? I would lean toward no, in terms of consistency with how we have approached other data sources in our system and proper attribution and transparency. One of the reasons our platform is open-source is because a lot of the data-gathering is scraping, which is very, very error prone - in terms of breaking but also capturing incorrect data. So if the data were scraped by a third party years ago from a government source in a way that can no longer be verified or checked (and possibly transformed from AQI, to bat)- or at least wasn't open-source in the first place, we'd want to think on whether this is appropriate. Taking an extreme case: Someone could make up data that they cite they scraped from Country X. How would we be able to verify this? Who would have the bandwidth and purview to verify this? If someone fakes their own data and shares it with us directly, at least the data are attributed to them solely, but if someone scrapes data that is attributed to them and, say, another country's government and shares it on the platform, that's a different story. There are probably legal considerations and liabilities we'd be taking on as an organization to this as well, that we'd need to explore.

  2. On research data (see sourceType in data format here) only being accessible via S3 buckets: I think this could create some weirdness between the research data and the government data (and eventually the other data). For instance, if you upload some research data from the past year, it wouldn't show up in the API, which is a little weird. A lot of the utility of data in the OpenAQ system is having it all in a harmonized format and comparable to other sources in the same place. So when you query the system for data from a given time period, you have the ability to get it all back from that time period, not in different places of the system depending on where you look. I'd be inclined to consider data that, say a researcher collected and wanted to upload as research data, as part of our general system so that when we open up data access beyond 90 days (or even if not), this data is included. This does mean we'd have to factor in the cost of serving this data out.

  3. On the data format for data tagged as research, I'd be mightily inclined at this moment to keep the same format as the other government data in the system (and the very tiny amount of research data already in there). If we have different format in the system depending on whether it is 'historically gathered' research data or whether we accessed it and ingested it- this dilutes a lot of the utility of OpenAQ in the first place: having a harmonized data format. We have found that folks from different fields - even subfields - have vastly different and often very, very specific requirements or desires for the data and meta data. To take any one group's detailed data requirements, often means cutting out a lot of sources that don't provide this data (or, conversely, a database with a lot of blank fields). If researchers have a lot more detailed data/metadata they'd like to share, they can always use the attribution url field in the metadata format to convey this. Or provide additional information/links to data in our metadata editor.

@ReaRuiRay - thanks again for starting this discussion; whether you think your concept will fit in as this discussion is fleshed out or not, adding in broad functionality for researchers to upload research data in our system is a really good and pertinent conversation for us to have right now. (And if you did decide to go a different way and do a flavor of this project separately, we will shout your project out to our network loudly.)

ReaRuiRay commented 4 years ago

Hey @RocketD0g , thanks for your detailed reply!

Indeed, there are lots of issues I haven't thought of. I totally agree that it would be extremely difficult to verify the collage of data and any misleading data may lead to potential problems. I need to think more about it and contact a few groups to dry run the idea.

My question is well addressed and I am closing this issue.

RocketD0g commented 4 years ago

Well...I don't know that I fully agree with me! :D

On one hand, it seems a major waste to completely disregard the data scraped by third parties in the past, on another, I really not sure how to handle the verification/attribution piece appropriately....anyways, keep me posted on your thinking.

And like I said, no matter what, how we find a way to upload the research type data is something that is on our radar to do in the near term and this discussion is helpful for us to think it through. And still interested in @jflasher's (or anyone else's) thoughts.