openaq / openaq-fetch

A tool to collect data for OpenAQ platform.
MIT License
83 stars 39 forks source link

Iran (Tehran) - Data Sources #51

Open RocketD0g opened 8 years ago

RocketD0g commented 8 years ago

This is listed as high priority b/c Tehran, Iran is experiencing very bad AQ. Media outlets report schools shut/will be shutting down to AQ. Plus it is in a region where we have minimal to no coverage currently in our system.

Most useful info: List of stations and coordinates: http://31.24.238.89/home/station.aspx Hourly physical concentration data: http://31.24.238.89/home/DataArchive.aspx (also downloadable via csv)

Other info: General map (in AQI format): http://air.tehran.ir/Default.aspx?tabid=193 General map with hourly data (in AQI format): http://31.24.238.89/home/OnlineAQI.aspx It appears they are using the US EPA scale. I assume this means they are doing a similar calculation for US EPA

olafveerman commented 8 years ago

The above commit scrapes this page: http://31.24.238.89/home/AQITable.aspx A long list of stations, but they only seem to report values at 11am local time.

We could get hourly data for a sub-set of the stations from: http://31.24.238.89/home/OnlineAQI.aspx

In the chart container, we can scrape the latest data from elements like:

<span id="ContentPlaceHolder1_OnlineDetailCO_lblCurrent" class="lblAQICurValue">45</span>

and the timestamp from:

<span id="ContentPlaceHolder1_lblCalculate" style="color:Green;">Update in 2/22/2016 at 16 o'clock</span>

@RocketD0g Basically the option is between:

  1. 36 stations reporting once a day. 20 of them have coordinates
  2. 20 stations reporting every hour. all of them with coordinates

Option 2 would be better I guess.

@jflasher What if we scrape both? Given two measurements with the same timestamp, param and location, will the system pick the one with the lowest averagingPeriod, or simply store the one that's reported first?

olafveerman commented 8 years ago

Before re-writing the scraper, we should figure out what units they're reporting in. My fear is that they're already converting the values to some sort of scale:

image

Couldn't find anything on the English site around methodology. @scisco Can we go through the Iranian version and see if we can find out more about the units?

jflasher commented 8 years ago

A bit more context for @scisco if he can take a look. We're only storing physical units so things like ug/m3 orppm. Another way to present info is using a unitless Air Quality Index. If that's what's being presented, we may not be able to store it.

RocketD0g commented 8 years ago

I think on the 'Data Archive' page they are storing physical values hourly for some locations. I'm inferring this only from the scale of the values shown (especially for PM2.5) and the fact they are using non-integers for reporting some values (e.g. CO); I don't see any units.

http://31.24.238.89/home/DataArchive.aspx

For a specific example, go to the Data Archive above and see: 'Pirozi' on 2/21.

On Mon, Feb 22, 2016 at 9:08 AM, Joe Flasher notifications@github.com wrote:

A bit more context for @scisco https://github.com/scisco if he can take a look. We're only storing physical units so things like ug/m3 orppm. Another way to present info is using a unitless Air Quality Index. If that's what's being presented, we may not be able to store it.

— Reply to this email directly or view it on GitHub https://github.com/openaq/openaq-fetch/issues/51#issuecomment-187189629.

507.246.2097

olafveerman commented 8 years ago

The non-integers may stem from the fact that they're averaging the values on that page. If you select a date range that spans multiple days, they still print 1 hourly average. They also don't provide timestamps on that page.

RocketD0g commented 8 years ago

Yeah, good point.

If we can confirm that they are using the US EPA methodology for calculating AQI (it appears to be so, given they are using the precise same scale, breakpoints and categorizations), we could contemplate back-calculating the hourly data to get to the physical values. I think we're going to eventually need to do this if we want to get some other large datasets (e.g. China and some other SE Asia spots). If we go that route, we'll need to be able to indicate in our data format somewhere that we're back-calculating.

On Mon, Feb 22, 2016 at 9:18 AM, Olaf Veerman notifications@github.com wrote:

The non-integers may stem from the fact that they're averaging the values on that page. If you select a date range that spans multiple days, they still print 1 hourly average. They also don't provide timestamps on that page.

— Reply to this email directly or view it on GitHub https://github.com/openaq/openaq-fetch/issues/51#issuecomment-187197492.

507.246.2097

scisco commented 8 years ago

@olafveerman The numbers in the charts you posted are an index number, which I guess means they don't have a unit. There is a map on the Farsi homepage which says what the index means:

screenshot 2016-02-23 06 12 07

Anything between 0 to 50 is clean, 51-100 still healthy, 101-150 unhealthy for young & elderly and so on.

Unfortunately, it seems that they only show the index not the actual number of all the stations across the city. With a quick look, not a thorough research, I couldn't find any mention of how they convert the actual readings to this index.

I'll try to look around and see if we can get the actual readings somehow and let you know.

scisco commented 8 years ago

This is the unit of the numbers: https://en.wikipedia.org/wiki/Air_quality_index

screenshot

olafveerman commented 8 years ago

Thanks @scisco I opened a separate issue to discuss back-calculating AQI.

RocketD0g commented 8 years ago

As a GitHub novice, not sure if there's a way to reference one comment in an issue, but here is a section of #79 relevant to this issue:


I think we'll eventually need to back-calculate to get some data sources, but here's an issue specific to non-US places using the US EPA method and us attempting to back calculate:

-In the calc above referenced by olafveerman, Ip is essentially a f(Cp), so the main issue (in addition to confirming they are using that method) will be to determine how a given place has decided on the time-average they use for Cp.

US EPA says, "The AQI reported for ground-level ozone and fine particles (PM2.5) is based on an average of hourly data. For ozone, the AQI is based on the maximum observed 8-hour average from midnight to midnight. For PM2.5, the AQI is simply the 24-hour average. For AQI values reported in real-time, before a full day's data are available, the AQI is based on a surrogate calculation*. " Source: https://docs.airnowapi.org/aq101

My sense is many places will plug in the Cp for whatever time resolution they measured, not necessarily following the straight US method. But it's so hard to say if they don't explicitly say on their website. I guess I'd hold off on building this back calculating tool until we have a concrete place to apply it.

Looking back at the Iran data\ in #51 , playing around with their site some more, I get different daily average values when looking at the map of reporting stations one way versus another, the 'daiy' archive and the AQI archive daily value listed for a station. Therefore, I'm not even sure what we would use as the daily AQI for a given pollutant, let alone assuming they are using the formula above to back calculate an AQI out of it.

For back calculating, if we were to build a tool, I'd say we should go for the Chinese AQI back calculation, as most of the websites reporting AQ in China report only the AQI (that I have seen), and we will know we are back-calculating correctly since we will be using the Chinese AQI algorithm on Chinese data. I will need to track down their AQI but I believe it is publicly available somewhere and will make an issue/assign myself to it.

*This is the surrogate calculation referenced by EPA: (http://www3.epa.gov/airnow/ani/pm25_aqi_reporting_nowcast_overview.pdf), which takes essentially a 3-12 hour average of PM2.5 data, depending on how much the data is fluctuating. Given the weighting system and that missing data for a given hour would be hard to spot, I think it is mathematically impossible to back calculate to physical concentrations for a given hour, even if you know the last 12 hours of reported NowCast AQI. I thought it'd be possible, but I couldn't get it ever to mathematically pan out - maybe I made a mistake with playing around with the equation here for the NowCast system (http://www3.epa.gov/airnow/ani/pm25_aqi_reporting_nowcast_overview.pdf). At any rate, on the bright side, my sense is that most places don't adopt the NowCast formula for their real-time data so that you can likely back calculate using an hourly interval.

**I actually do think that the 'data archive' (http://31.24.238.89/home/DataArchive.aspx) is likely physical concentrations given the titles of the other sections and playing around with the numbers over multiple days. But there's no time stamp and no way to verify this (trying to re-calculate the reported AQI on other parts of the site from these numbers pans out close, but not precisely).

scisco commented 8 years ago

We can also contact the people in charge of the site and ask them about how they are calculating the AQI or if there is a way to get the actual numbers.

RocketD0g commented 8 years ago

Absolutely! I think in general that's the # 1 route to go when possible. I haven't seen a way to do that from the site (at least in the English version) - do you, @scisco? They reference a company that maybe we could contact: http://sdra.co.ir/

majesticio commented 1 year ago

Reopening with new source, data, table same problems: