openaq / openaq-quality-checks

10 stars 1 forks source link

OpenAQ Community: What is your OpenAQ data quality experience? #2

Open abarciauskas-bgse opened 6 years ago

abarciauskas-bgse commented 6 years ago

We would love your input!

As part of the quality check design process (outlined in the first issue), we are seeking feedback from the OpenAQ community.

Please provide answers to any of the following questions by adding a comment to this issue:

  1. What is your use case for OpenAQ data?
  2. Have you come across data quality issues? What types of issues and do you have examples?
  3. What types of data is being filtered out in your custom scripts?
  4. What type of functionality would be helpful in a tool for providing quality data?

Notes:

RocketD0g commented 6 years ago

I will kick it off with a few thoughts from previous discussions. (This is so exciting.)

1. Data issues we have noticed that would be good to flag:

a. Negative data values (These could be there for many reasons, possibly because the real value is low and within error of the instrument or a calibration issue).

Example for a: screen shot 2018-02-06 at 1 43 22 pm

API call to see this example: (may need to specify the specific date above in screenshot, if accessing this later) https://api.openaq.org/v1/latest?location=ARB%20OER

b. Specifically, '-999' being reported to signify an instrument is not reporting data.

Example for b: screen shot 2018-02-06 at 1 16 52 pm

API call to see this example: https://api.openaq.org/v1/measurements?location=US%20Diplomatic%20Post:%20Addis%20Ababa%20School&date_from=2018-02-02&date_to=2018-02-06

c. Zeros that may potentially indicate not true zero values, but rather non-reporting instruments - hard to know but especially suspect when they are repeating or seen at regular intervals.

This particular case was reported by a user.

Example for c: screen shot 2018-02-06 at 1 51 15 pm

API call to see this example: https://api.openaq.org/v1/measurements?location=Sector16A,%20Faridabad%20-%20HSPCB&data_from=2018_02_01&date_to=2018_2018_02_03&parameter=pm25&limit=3500

QUESTION: What threshold should we make for flagging a zero? If the neighboring data points in time are above a certain threshold? All zeros?

d. Repeating positive, and zero values because an instrument is 'stuck' reporting the same value over long intervals (e.g. over several hours or even days and weeks).

Example for d: screen shot 2018-02-06 at 1 07 02 pm

API call to see this example: https://api.openaq.org/v1/measurements?location=Maninagar,%20Ahmedabad%20-%20GSPCB&date_from=2018-02-02&date_to=2018-02-06&parameter=pm25

QUESTION: What should be the threshold for flagging the value? After 3 repeats? (Could we make it an adjustable parameter set by the user, with a default of x?)

2. It'd be cool to be able to specify a time interval at a country or city level or location (i.e. station)-level and have the output data be returned in both json and csv. (eg. all measurements in Peru between 30-12-2017 and 15-01-2018) and perform checks to flag suspicious measurements).

3. It'd be cool to get data output:

4. Likely outside the scope of this initial project, it'd be cool if users could specify removing data points that are x-times a certain standard deviation in measurements taken from a single location (station) over a time interval.

Tagging a few folks who have worked a lot with data from the platform and who may have noticed other data quality issues discernible from a general level (e.g. not requiring location-specific meta data or other knowledge): @dolugen @maelle @dhhagan @jflasher

dhhagan commented 6 years ago

@RocketD0g Agree with most of these. In addition, it may be nice if there are instrument-level flags for stations that are known to be problematic (there are at least a couple in Delhi that have errors >= 3 orders of magnitude). Number four above is nice, but may be better left to the data analysis part (off server, or off-serverless now, I guess) - it may just be easier/better to show users how to do this via R/python/js.

Re: number two above - what is the purpose of returning the data both in json and csv? Must have missed it...

As for the implementation of the status flags, I've always stored flags as integers and use bitmath to store more complicated/multi-flag scenarios. For example, if a negative value flag is 0b01 and a stagnant value flag is 0b02 then theoretically a station could simultaneously be both negative and stagnant which would result in 0b01 | 0b02 = 0b03. This way, you're only storing 1 integer in the database. On the output side of things, you could either serve it to the customer this way and let them decode it (does it have a negative value flag? flag_value & 0b01?) or you could decode and then send a string with a description..

RocketD0g commented 6 years ago

Thanks, @dhhagan - appreciate all of these thoughts.

Question on the idea of station level flags for locations/instruments known to have shown problematic output or otherwise not trusted: Do you (or others reading this) have ideas on how to designate "problematic" stations on an ongoing basis or across the board - what sorts of thresholds should be in place to designate a spot "problematic?" e.g. We could flag anything anyone reports as being suspicious (perhaps a code for "user reports potential issue" - but of course, if an instrument is replaced or the problem corrected, we won't know).

I'm guessing we won't have the bandwidth to investigate every station that someone may want to report to verify in some way if it truly is acting up, etc. - at best I would think at first blush that we could just say the user reported a potential issue. Finding a systematic way to have a station flagged - and also an ability for information to flow to us to know it can be unflagged - seems tricky to do well at first glance to institute across the board. I will think on it more too because it's an important point, just don't know how to handle it.

For Number Two - it was just keeping the same data format outputs that the API currently also outputs.

EbenCross commented 6 years ago

Few thoughts: (1) I think we should focus on developing flags on a per measurement (rather than per station) basis. That way we can justify 'trustworthiness' on the basis of what is known/expected from a specific technique and avoid some of the negative political consequences of black-listing specific sites altogether.

(2) repeat data flags is a good idea. like the adjustable parameter here;

(3) ZERO points are only problematic if 0.00000, otherwise likely just low values; Gets back into understanding limits of detection for specific kit.

(4) Potentially useful AQ context derived from different negative values (-999 or -9999 or -999999) in the raw output data, so definitely worth retaining access to these values.

(5) Other than -999 or 0 data there's often straight-up missing data (data not found at a given interval). The fact that the data is simply removed leaves some open questions - including: How (if at all) does the missing data impact the resultant 'average' values reported. What are the start and stop times for each reported average and what fraction of this time was removed? In this context 'missing' data should be flagged, but not necessarily as -999.

abarciauskas-bgse commented 6 years ago

Thank you @RocketD0g @dhhagan and @EbenCross for sharing all this great brainstorming! Looking forward to incorporating them into the tool specifications.

nickolasclarke commented 6 years ago

Some thoughts:

@dhhagan's idea of using bitmath to store a single value is quite good, but decoding and providing a more explicit values in the returned json would be preferred.

In prior work we've attached weighting to values that are correlated to the expected reporting delta vs the actual delta to help with producing more accurate averages. That may be a good way to help expose more accurate averaging information if data is flagged through this tool.

building on @RocketD0g's 4th point: Agreed that this is probably out of the scope of the initial cut of this tool. However, we use openAQ data and similar to get broad ideas of air quality in a geographical region, say an entire city or district. It would be interesting if we could devise a proper way to filter out data from individual stations if they are N x standard deviation from their own historical data, as described, or if it exceeds a a similar deviation from nearby stations. even if the data is valid, (say someone smoking nearby the station) it is not for our purposes when we are trying to generalize. I'd rather exclude the data rather than give individual a false impression of the air quality.

Otherwise, you've all hit on what I'd like to see and more!

RocketD0g commented 6 years ago

Another item to flag:

To get around stations that have had multiple names over time, we could group stations by location (geo coordinates) so that they potentially all inherit the same name as a new field or something.

This is based on an issue someone from AER brought up below, regarding issues of same station with different names at different times:

screen shot 2018-02-14 at 3 14 05 pm

To note: Those locations with slightly different station names are, in truth, the same locations/stations with the same coordinates. We pull the names directly from the source, so how the source decides to name (or rename them later) dictates how they appear in the system. For instance, in this case, the underlying source re-named or otherwise re-input the station name from "Mandir Marg" to "Mandir Marg Delhi- DPCC" at some point. Same thing with R K Puram. Also to flag: There should be no overlapping data points, e.g. same actual measurements+timestampe for the same location but with different names.

RocketD0g commented 6 years ago

Specifically to @EbenCross's point quoted here from above (and after hearing him talk about it, so I get it better):

(1) I think we should focus on developing flags on a per measurement (rather than per station) basis. That way we can justify 'trustworthiness' on the basis of what is known/expected from a specific technique and avoid some of the negative political consequences of black-listing specific sites altogether.

I really like his idea of the ability to look at past station-level data over some time period (perhaps user set?) to report statistics of how often a station reports no data (e.g. NaN). It could point to stations prone to data reporting issues, as well as, if looked at across an entire source, source with particular data-reporting issues too.

I wonder if this could be expanded so that a user could specify a particular value (e.g. NaN, 0, -999) or set of values to get back such a report.

nsuberi commented 6 years ago

Hi all! While working for the Resource Watch project (https://resourcewatch.org/data/explore) I wrote some functions to address RocketDOg's part 4 above:

  1. Likely outside the scope of this initial project, it'd be cool if users could specify removing data points that are x-times a certain standard deviation in measurements taken from a single location (station) over a time interval.

https://github.com/nsuberi/ResourceWatchCode/blob/master/Notebooks%20for%20Exploring%20Specific%20Datasets/OpenAQ%20Windowing%20For%20Anomalies.ipynb

Best regards, ~ Nathan

jflasher commented 6 years ago

Very cool, thanks @nsuberi!

maschu09 commented 6 years ago

Great discussion. We have also been thinking about auto-QA a bit and came up with a plan to check measurement quality (it's indeed per timeseries and not per station) in 4 steps:

  1. remove absolute crap: this will be a simple range test with very relaxed limits; for example, temperature (in K) could be tested for the range 180 to 350. Such test will catch (some) unit errors (temperature listed in degrees C etc.) or other substantial failures.
  2. simple statistical tests for outliers, negative values, constant value episodes (and possibly more) - there is a bunch of such tests available in the statistical literature
  3. consistency tests: these would look at how well new data fits to old data from the same place (similarity of frequency distribution, seasonal cycles, etc.) - sometimes hard to know if data are wrong if they don't match "reasonably well", but at least a hint; this type of test could also involve multi-species correlations. For example, if CO and PM correlate with a slope of 2 in one year, it is very unlikely that this slope should be 1 or 4 in another year. Defining such tests will require more analysis of such correlations, but it should be possible to define a (large) set of conditions under which certain ratios or slopes apply.
  4. regional similarity tests: check similarity of timeseries from nearby sites. Again, this will require more analysis to find conditions under which similarity should occur, but in a simple setup one could look for example at the frequency distribution of differences between two sites and then test if these distributions change over time.

A postdoc in my group has begun writing some Python code for this - we will share it when it works ;-)