maelle / openaq_figures

:mask: some figures about OpenAQ growth :mask:
1 stars 0 forks source link

How to describe growth of the platform? #1

Open maelle opened 8 years ago

maelle commented 8 years ago

@RocketD0g I copy-paste things from yesterday evening's email. All figures are here https://github.com/masalmon/openaq_figures/tree/master/code_files/figure-html Not sure it's a good way to share them, next time I'll try to be smarter.

Regarding the growth of the platform I think the graph with the number of locations over time on a log scale is the best one. In the legend one could say that for now nearly every leap corresponds to the addition of the sources of a country. However, the figure does not show that well how the level of details is different for each country: the US are much better represented than, say, India.

RocketD0g commented 8 years ago

These are fantastic! 💯 ! I like the one most as you. I think it might be nice to show on a log-lin scale like this: https://en.wikipedia.org/wiki/Semi-log_plot#/media/File:LogLinScale.svg

I think it might be easier/intuitive for non-scientists to interpret immediately with that sort of scale (which is secondary to the purpose of a scientific, I know, but might be nice for others, including policy folks, viewing).

A similar idea is showing number of measurements over time. Maybe could be overlain on the same plot (since with log-scale on y, it should not dwarf the locations v time plot)?

cc; @jflasher, in case you're interested in checking out!

maelle commented 8 years ago

@jflasher, is there an easy way for me to get the number of measurements per day since the start of the platform? I guess querying all measurements via the measurements endpoint is not the right idea :laughing:

maelle commented 8 years ago

Note to myself: use scale_y_log10() in ggplot2 for getting the log-lin scale.

maelle commented 8 years ago

@RocketD0g I wouldn't plot them on the same plot but it could be a plot with two horizontal parts so that they are aligned on the x-axis, as in https://learnr.files.wordpress.com/2009/05/3_time_panel3.png?w=600

jflasher commented 8 years ago

@masalmon Easiest thing is for me to get these from the database directly for you. I was thinking this would be a cool view to have as well. Will try and get them to you soon.

maelle commented 8 years ago

I expect the number of measurements = f(t) to be several linear segments with an ever increasing slope. I'm looking forward to see how it looks like.

@jflasher I guess a good aggregation period would be daily counts?

jflasher commented 8 years ago

Yep, I think daily would be good. Data is in https://www.dropbox.com/s/g8jeiu3footz3pl/openaq_daily_numbers.csv?dl=0 You can also see when the platform has had issues. :(

maelle commented 8 years ago

Thanks, I'll look at it later.

With added counts/hour or minute (not the cumulative sum, the new measurements), one could get a nice graph with "heartbeats"... including the heart attacks.

jflasher commented 8 years ago

Hourly can be found at https://www.dropbox.com/s/qzua981ir0jml9u/openaq_hourly_numbers.csv?dl=0, looks roughly similar just less smoothed, don't think we can go any lower than hourly.

maelle commented 8 years ago

(by station one expects a heartbeat with same frequence over time -> you could actually have a control board http://blog.jupo.org/static/img/metrics2-large.png -- nice deco for a living room, right?)

maelle commented 8 years ago

I've made this new candidate (it's also in the figures folder)

growth

@RocketD0g Now I actually think it looks better on a "normal" scale (I tested the log scale as well and didn't like it that much).

@jflasher actually if you have a log of good/bad days of the platform I could add orange segments on the plot for where there were issues. I can see a flat segment, but that's the only one I spot at first glance.

jflasher commented 8 years ago

@masalmon if you look at the counts added per hour or day, you'll see when it drops very low or 0, that'll be the the time when the system had problems fetching new data. It's more obvious when not looking at it cumulatively. Also, think we might want to change observations to measurements to match the platform.

maelle commented 8 years ago

@jflasher oh my bad I wanted to change observations because I realized it was not the right word. I had changed it at one point because measurements was too long and then I got better at formatting, hehe.

Would you want me to add something like blue = good day orange = day with technical difficulties?

RocketD0g commented 8 years ago

I like this plot and the way it's laid out a lot. I like the blue/orange segments idea, as well. To know what % of the time it is blue v orange might be useful to quote in the paper.

maelle commented 8 years ago

Is a day "with technical difficulties" if there is at least one hour with very low counts?

RocketD0g commented 8 years ago

Could we define it as a day with x% less than the previous day's counts? Debatable what x% should be...

jflasher commented 8 years ago

I guess really what you'd want to do is see if it's some % lower than the average of the ones to say 12 hours on either side of it. But if that's a pain, there can just be some other approximation. This is a bit hard to say because lower counts aren't necessarily on us (thought the 0's definitely are!) but when EPA was down, our counts would be lower, but not because of us.

RocketD0g commented 8 years ago

So perhaps go with a marker that shows system-wide down-time on our end- so like 50% of data missing relative to the previous day?

I wouldn't look 12 hours forward (or forward at all) - we could have added new data to the system then, and that's not 'fair' to the previous 12 hours.For instance, the data prior to the EPA add could look like an 'outage' when it's really not; it was just legit less data coming into the system.

maelle commented 8 years ago

Anything is possible. So for each day I'd calculate the average/median/min number of counts? And compare it to the average/median/min number in the last and next 12 hours and it should be X% lower? For avoiding the new data issue, all of this could be divided by the number of locations in each hour so the best thing to have would be to have a column " no of locations" in the hourly count data. I'll let you argue over a formula and then implement it :-D

jflasher commented 8 years ago

I wouldn't look 12 hours forward (or forward at all) - we could have added new data to the system then, and that's not 'fair' to the previous 12 hours

good point!

RocketD0g commented 8 years ago

(Thanks but @masalmon's make the better point we can just adjust by # of locations haha)

maelle commented 8 years ago

Statistics power! :-D

RocketD0g commented 8 years ago

Although thinking on it, I guess looking forward still might be weird b/c a given location (or batch of locations) we add in could have 1-7 (or in reality 1-6) data points associated with it over some given time interval. The batch of data we add in, could then have a different average number of data points per location per time interval associated with it than the average in our platform previous to that. I'm guessing that normalizing to datapoints per location per time interval still might not fully normalize for comparison - though might only be a potential issue for really large data add-ins relative to the size of the platform?

RocketD0g commented 8 years ago

To @masalmon's comment quoted below: Whether we use +/-12 hours, -24 hours, it might be neat to plot it out entirely separately from the the other plot date above: something like a time versus y% of expected data in system. e.g. then we don't have to pick an arbitrary x% loss and can just see how it plots out? Totally realize that might be difficult, @masalmon.

So could also imagine instead just sticking with that same above original plot and picking x=10% loss (say light orange) and x = 90% loss (say a dark orange) cases. This could show when data conks out system-wide (and is therefore on us - 90% loss case) and when we have local but significant issues (could be us, could be the source site - 10% loss case). This would be simpler than my above thought, I presume (and maybe just as about as insightful..).

Hmm, thoughts?

Anything is possible. So for each day I'd calculate the average/median/min number of counts? And compare it to the average/median/min number in the last and next 12 hours and it should be X% lower? For avoiding the new data issue, all of this could be divided by the number of locations in each hour so the best thing to have would be to have a column " no of locations" in the hourly count data. I'll let you argue over a formula and then implement it :-D

maelle commented 8 years ago

Calculating the expected number of measurements is not that difficult: it's the sum over all stations of 24*60 seconds / aggregation period of the measurements. I've just had a look again at the locations endpoint: I do not see info about aggregation period? Moreover could the aggregation period be different for the various parameters in the locations? How could I find this information?

I think it'd be nice to know the expected number of measures each day/hour (we might have to use hour to be precise since a location can be added in the middle of the day). At each hour of life of the platform there must be a way to know how many measurements are expected, and we can compare this to the actual number of measurements (and check this number per day is the slope of the curve of the cumulative number of measurements). We can then calculate a daily proportion for a plot.

In general, how do you guys monitor the fetch thing to see whether it is working well?

Ah, this is a very large discussion but using data about actual number of measures / expected number of measures we could know so much about data availability in the different locations... Is there a location where missing measures are nearly always due to OpenAQ (i.e. like a control?)?

jflasher commented 8 years ago

You can find averaging period for the measurements by asking it to be included like https://api.openaq.org/v1/measurements?include_fields=averagingPeriod. Note that not all measurements have this included at the moment.

My gut feeling is that the data is going to be too noisy to really tell something exact by doing this measurement. Different parameters can be reported at different times and locations can pop in and out so getting the exact number for any given hour may be difficult.

For our monitoring of the fetching, we look at two things. 1) is that the fetch is actually happening. If no fetch has happened, then that's the first problem. The second is to look at errors when new data is inserted. There is a record of those errors in the /fetches endpoint in the results for an individual adapter. But right now, if something goes wrong with the adapter in such a way that it doesn't even try to insert any new data, we'd miss it unless we saw that some country hadn't been updated in a very long time (that is something that should be added to Fluffy).

maelle commented 8 years ago

I had an idea while running! Between two "location births" we expect the number of measurements per day to be more or less constant. Bad days are outliers inside each of these periods. We have a time series (daily counts) with changepoints where we know the changepoints! Outliers could be defined as a quantile using maybe a robust negative binomial regression inside each period. I know the tools, it is easier than it sounds.

I just have to figure out how to assess days when a station or several stations was/were added. I guess it is hard to know how many measurements we expect for that day but we do not have that many days where the total numer of locations was not the same in the morning vs. In the evening, right? So maybe I can ignore them!

RocketD0g commented 8 years ago

I would think that you could even leave the 'add periods' in, if you do the previous 24hr retrospective look (instead of -12/+12) and y% of expected data for the day v time. In that case, wouldn't the big-add days show up as >100% of expected relative to the day before, therefore identifying themselves?

This system will also prob make the true 'conk out' periods look kind of funny with large spikes above 100% for the 'recovery day' after (and may return an infinite number for the next day if we have any days where there were 0 counts...). But this would make the 'conk outs' versus the add days look distinct from each other, given the conk out will have a major dip % and then a major spike %, while the add days will spike %, but less majorly?

maelle commented 8 years ago

My idea with transitions was actually bad because there are too many of them. I'll implement the % idea in a minute !

On this graph count = number of daily measurements and each vertical line = day at which at least one station was added to the platform. The colour of the line depends on the number of stations added. It's an useless figure for publication but I like seeing how a lot is happening these days in the platform !

transitions

maelle commented 8 years ago

So here it is (below and in figures/). In this figure I do prospective monitoring: I compare each count to the last count that was not already detected as outlier (comparing a day to the previous day if the previous day was a low count day would be bad, in this case for a period of bad days only the first one would be detected). Bad day = less than 90% of the last normal count.

toolow

I'm not sure 90% is a good number. Were there many problems at the beginning of the platform or does it show how wrong this current monitoring figure is?

maelle commented 8 years ago

Besides I've been thinking about monitoring (monitoring count time series was my PhD subject so I might be obsessed!). Would it make sense to add monitoring for the daily number of counts to the platform, at the platform level (very bad day -> everything fails, low count of measurements), at the fetch-adapter level (a problem here might indicate the thing to which the adapter connects has changed/isn't working on a day?) and at the station level (if there are less counts at the location level but not at the adapter level, maybe it shows rather a problem at the location itself rather than an OpenAQ problem?). Or even at the station-parameter level... Plus, does one currently detect whether a location gives the same value for a parameter for a too long time? (but well there's nothing to be done about it...)

Moreover I wonder how an user would know how much data is available for each location without having to query all measurements. I guess it's impossible?

Well maybe we should keep the discussion about the figures for the article, there's enough work for this subject! Sorry about all the "side-questions"!

maelle commented 8 years ago

Last thing: in the article you could maybe discuss whether the not-fetched data is lost forever or whether there's hope to recover part of it when adding historical data to the platform one day? Do we know how many locations have an available archive vs. how many locations show data online before its being lost forever?

RocketD0g commented 8 years ago

Awesome! Thanks, @masalmon! This is interesting. My guess is, is that at the beginning, one of the main sources (which I think there were just two sources reporting ~10-20 stations) likely wasn't super stable in reporting spots, and so it was easy to get down to the 90% mark. What's a 10% mark look like?

Yeah, think we should mention that the plan is to include the ability for entities to insert historical data, not gathered in real-time down the road.

No, we don't know how many locations have available archive vs. otherwise lost. It's difficult to accurately assess and logistically approach, even just in terms of language.

So here it is (below and in figures/). In this figure I do prospective monitoring: I compare each count to the last count that was not already detected as outlier (comparing a day to the previous day if the previous day was a low count day would be bad, in this case for a period of bad days only the first one would be detected). Bad day = less than 90% of the last normal count.

I'm not sure 90% is a good number. Were there many problems at the beginning of the platform or does it show how wrong this current monitoring figure is?

maelle commented 8 years ago

@RocketD0g I put figures with different cut-off in the figures folder, e.g. 10% is in https://github.com/masalmon/openaq_figures/blob/master/figures/toolow0.1.png

I have no idea why it took me so long to open the project and write the loop, sorry!

RocketD0g commented 8 years ago

The various cut-off figs are awesome to check out - and sorry for my very delayed response to this, @masalmon. How easy would it be to label the 90% cut offs, say, a lighter blue or green from the 'true values', and then the 10% cut off's stay still orange? Then we could just show the 'extreme' cut offs on the same graph and convey the range of 'full blown' outages, as well as smaller issues. What do you think?

maelle commented 8 years ago

Thanks @RocketD0g !

@jflasher could you please put a new version of the daily measurements counts somewhere? Thanks a lot in advance.

maelle commented 8 years ago

Growth figure, out-of-date for measurements:

growth

maelle commented 8 years ago

Status figure with the data I know have. For each timepoint the reference is the latest maximal value before that timepoint that was not an outlier (issue or outage with the current wordchoice that could be chanded @RocketD0g ) Limits are below 90% and 10% for issues and outages respectively.

I have dropped using the OpenAQ colours on this graph, using the viridis scale instead. The viridis package vignette states "Use the color scales in this package to make plots that are pretty, better represent your data, easier to read by those with colorblindness, and print well in grey scale."

Here the x-axis labels could get better too.

status

jflasher commented 8 years ago

@masalmon here are latest daily numbers. https://www.dropbox.com/s/fbno67zckcfloar/openaq_daily_counts.csv?dl=1

For future reference in case I need to do this again, query is

select date_trunc('day', date_utc) as date, count(*) from measurements group by date_trunc('day', date_utc);
maelle commented 8 years ago

Cool, thanks a lot @jflasher !

You will probably have to do this again, so that we might submit the latest version ;-)

maelle commented 8 years ago

New version of the two "best" figures until now @jflasher and @RocketD0g

growth

status

jflasher commented 8 years ago

I think we should take out last day since it is always incomplete.

maelle commented 8 years ago

Oh yes that's true. I corrected that, thanks!

status

RocketD0g commented 8 years ago

@masalmon: I think this fig is great! I'd like to put this version in the draft of the paper - sound good?

If so, what about a y-axis label of "Data Points Aggregated per Day" Perhaps for "Status" it could say: 'Normal' instead of 'alright'?

cc: @jflasher

maelle commented 8 years ago

Here you are! True story: I was too lazy to order levels of the status variable at first, so I chose names whose alphabetical order would be easy to deal with :smile:

status

maelle commented 8 years ago

Oh wait the x axis could be nicer...

maelle commented 8 years ago

status

maelle commented 8 years ago

I can change the y-axis label if you prefer to have the word aggregated in it, of course!

RocketD0g commented 8 years ago

Woo! Thanks, @masalmon!!