wadefagen / 91-DIVOC

Source code for 91-DIVOC
https://91-divoc.com/
GNU General Public License v3.0
60 stars 18 forks source link

Trend lines are too noisy to be useful #30

Closed andyross closed 4 years ago

andyross commented 4 years ago

The new trend line feature is great and very helpful, but... not very trustworthy at all.

It looks like it's implemented by drawing a line on the log chart between the current value at day N and the value at day N-7, ignoring all the data points in between.

That's... kinda just wrong. It ends up being very sensitive to the sampling error at both of those points, which for some data sets is really high (c.f. France's occasional burps where they seem to batch 2-3 days into one day of reporting).

What you need to be doing is fitting a line to those 8 data points. Just a dumb least squares fit of a line to the log chart would be infinitely better than what we have.

danie906 commented 4 years ago

I would fit the last 21 days and use that to project the next 7 days.

from scipy.optimize import curve_fit
x = np.array([N-21:N])
xx = np.array([N:N+7])
def func(x, a, b, c, d):
    return a*np.exp(-c*(x-b))+d
popt, pcov = curve_fit(func, x, y, [1,1,1e-6,1])
yy = func(xx, *popt)

But I like the idea of using a logistic fit too.

flbuddymooreiv commented 4 years ago

What you need to be doing is fitting a line to those 8 data points. Just a dumb least squares fit of a line to the log chart would be infinitely better than what we have.

This breaks down once the curves inflect and become indicative of logistic. You can turn the trend lines off if you want. Personally, I think they do bring something to the table, but you're right about them being of limited use once the derivative transitions from growth to decay.

andyross commented 4 years ago

Fitting 21 days is way too many, the curves are changing on timescales much less than that. I think 7 days if fine.

And regarding algorithm: meh. Extrapolative fitting techniques are an infinitely deep bikeshed. People with serious needs will be using their own tooling. I just want the trend lines to not get wonky when the inevitable badly-reported day gets a line drawn through it.

flbuddymooreiv commented 4 years ago

Where are you seeing 21 day trend lines? I only see 7 day trend fits.

But I am wondering: you are aware you can disable the trend line, correct?

I fundamentally disagree that a logistic regression fit is an infinitely deep subject. It comes with an uncertainty just like other regression fits. But honestly for the sake of granularity, maybe we shouldn't even be discussing regression curves here anyway. I had already made a feature request for one on a different issue and they have nothing to do with your actual issue here.

Edit: very sorry - I realize now the 21 day is a proposed alternative. I agree 21 days is too broad of a time scale to project the trajectory of a virus with a 4-15 days incubation period.

danie906 commented 4 years ago

Having looked at the extrapolated fit in the log scale plot, I agree it is not the right equation to use. However, for a logistic equation, 21 days looks to be about right to calculate the correct coefficients. There is a lot of noise in the data. I wouldn't go longer than 21 days and 14 might be a better duration to fit. I will take a look at it and get back with a pull request.

flbuddymooreiv commented 4 years ago

Having looked at the extrapolated fit in the log scale plot, I agree it is not the right equation to use. However, for a logistic equation, 21 days looks to be about right to calculate the correct coefficients. There is a lot of noise in the data. I wouldn't go longer than 21 days and 14 might be a better duration to fit. I will take a look at it and get back with a pull request.

Please discuss here https://github.com/wadefagen/91-DIVOC/issues/29

andyross commented 4 years ago

To clarify: this bug report isn't about choosing the best fitting algorithm. Objectively, real experts are never going to agree on that.

This bug report is that the trend lines as implemented are using only two data points out of a very noisy data set, and are thus not useful. They should be fit (using any algorithm, but like I said I'd be happy with a least squares fit in the log space) to the full 8 points they're supposed to be covering.

pricelessbrewing commented 4 years ago

Yeah, just using today, and 3/7 days ago as the only relevant points and ignoring everything that happens inbetween is pretty useless.

Fit it using any method. Could even combine several, and use whichever fits the current trend best.