Open lemeb opened 4 years ago
This is a fantastic bit of feedback thanks for writing it up. I am unhappy with the smoothing as it stands not because it's centered per-se but because of the min_periods=1 arg. Effectively, every point is an average of 7 days (weighted) except the last few days which become concentrated. The problem I have with just uncentering is that you effectively have the same values just delayed a few days, which would in turn delay Rt's value by a few days. I think this is right, but posting my thoughts so there can be a discussion. I'm also wondering if there might be another kind of filter that would be better than a simple windowed technique (eg. kalman, etc)
Notice the delay effect below on CT's new cases:
The red suffers from over weighting on the last value whereas the black does not, but the black is just the red line delayed by half the window size.
Thoughts?
Here's a very simple g-h filter on the data which actually looks quite promising and suffers from none of the windowing effects. Notice I zoomed so you could see the effect.
Notice how the GH tracks closer to the true value without the uptick at the end?
the model is re-writing history every day. That's not great for something that is meant to be tracking Rt incrementally
The reported cases are re-writing history every day: positive tests reported on day N could easily be for a person who became infected on day N-7. Here's a couple highlighted screenshots I happen to have at hand (for WA and TN)
This is a different issue - this is about taking current counts and moving them from reporting time to onset time (which would shift them backwards). I believe that's recorded as a different issue (I agree with the issue btw)
I think @k-sys's suggestion to use a Kalman filter for this is a good one. Depending on whether or not you are concerned with centering, you can use the filter or the smoother from the Pykalman package for this, and handle this in a couple of lines of code. I would personally recommend the smoother, as I am not concerned with older estimations changing for these purposes.
A g-h filter is fine too, as is pd.ewm; they all get you to the same place for this problem. Pykalman is nice because it has a built-in optimization method for the parameters via the .em()
method, and works as a smoother as well.
I'm going to play around with this today and see what results based on different states' situations. I'm still open to suggestions.
I think the principles for adopting something are that it deal with outliers well, generally centers on the data / doesn't introduce unnecessary delay, ideally is resilient to new values (smoothing/filtering tradeoff)
Kevin
On Tue, Apr 21, 2020 at 6:08 AM, kmedved < notifications@github.com > wrote:
I think @ k-sys ( https://github.com/k-sys ) 's suggestion to use a Kalman filter for this is a good one. Depending on whether or not you are concerned with centering, you can use the filter or the smoother from the Pykalman package for this, and handle this in a couple of lines of code. I would personally recommend the smoother, as I am not concerned with older estimations changing for these purposes.
A g-h filter is fine too, as is pd.ewm; they all get you to the same place for this problem. Pykalman is nice because it has a built-in optimization method for the parameters via the.em() method, and works as a smoother as well.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/k-sys/covid-19/issues/30#issuecomment-617168249 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AACF3ENSLDEHFE2XGPJ4YJ3RNWLGVANCNFSM4MMZMAOA ).
This is a different issue - this is about taking current counts and moving them from reporting time to onset time (which would shift them backwards)
My point here was that @lemeb's argument for a model that doesn't change previously publicized values isn't really feasible, since each day's reports should affect the calculation for multiple prior days.
A g-h filter is a great choice. I didn't know about it until you brought it up, but it is indeed much better than a trailing average and it preserves older estimations. So that's great.
After playing around with a simple g-h filter for a couple of hours, I'm a little baffled about which g and h to choose, as it seems that some seemingly reasonable g and h make the highest_density_interval
function crash because of a lack of good highs and lows... But that's solvable anyway.
Came here since I was unhappy with the ending artifacts of the gaussian filter used in this approach.
I'm thinking that a filter that will hold on to the near-end Rt estimates is valuable (*) .. especially given the growth factor is exp((Rt-1)/7) applied to day-on-day case arrival values. To meet this, I've chosen to do a centered gaussian filter on day-to-day log ratios. This filter is somewhat capricious compared to filtering the raw case rate if you look at actual case values, but since the growth model is scale independent, that is (I think) acceptable for the purpose of estimating Rt - i.e. for an Rt of 2.5, say, it wouldn't make much of a difference to Rt itself whether k is 100 or 1000, though it will likely affect the spread around the estimate.
edit: (*) This is because gaussian filtering (or perhaps even most filtering techniques that operate on daily case arrivals) typically holds the end value beyond the end and therefore biases the Rt estimates near the end to move towards 1.0 .. which is terribly misleading given we're trying to track it "real time".
(pseudo julia code)
logratio(x) = [ log(x[i]/x[i+1]) for i in 1:(length(x)-1) ]
logfilter(x) = vcat([x[end] * exp(a) for a in reverse(cumsum(reverse(gaussian(logratio(x)))))],[x[end]])
Given I'm willing to accept history rewrites, is there something that would introduce another kind of systematic bias with this approach that I'm missing?
Update to previous comment - The log ratio filter is a bit too sensitive to near-end fluctuations - so if we get a spike on the last day, that becomes significant as it gets extended. To counter that, I'm now using an exponential filter followed by the log ratio filter. Fairly content with it for the moment.
Would you be able to show the code with the exp filter and log ratio filter implemented? I've also been having issues near the ends (like log(0) issues, etc.) and would like to see how you implemented this. Thanks!
Would you be able to show the code with the exp filter and log ratio filter implemented?
You can find the code here - https://github.com/Imaginea/covid19/blob/master/rt-turing.ipynb . It's in Julia since I used Turing.jl to implement the model ... much shorter code.
Please see the "true filter" section (no filter is "true" - see note below - but the name is to make the intention clear :). Here's an explanation -
The filters are implemented by hand as repeatedly applied simple FIR (finite impulse response) filters. The filter(x,n)
function does a [0.25,0.5,0.25]
centered FIR filter on the array x
, and does it n
times. This approximates a centered gaussian filter if done a few times. Similarly, the fir2filter(x)
does a [0.25,0.75]
FIR convolution, which gets you causal finite geometric decay if repeated a few times before itself starting to look like a gaussian. The truefilter(x,n)
applies the fir2filter(x)
on the raw daily counts n
times , followed by the gaussian approximation on the log ratio, also n
times. In my case, I chose n=5
.
I think the bottomline is that given the weekly cycles embedded into the case reporting, we have to live with about a week's latency for a reliable Rt estimate. Any attempts to reduce the latency for a reliable estimate can only mean we have a filter that's already capable of predicting Rt a week into the future .. which unless no variation in action is seen over the previous couple of weeks or so and aren't planned for the coming week, is close to clairvoyance or at the least requires rich data from diverse sources. This means, we're essentially estimating Rt for a week ago .. which I think is still valuable. I only want to avoid giving the false impression of "Rt has been going down over the past few days" just because I used an inappropriate assumption for the filter.
edit: I just added this note to the notebook itself.
First off: this is an absolutely kick-ass and necessary project. Thank you so much for this, and for the transparency that you're putting behind this.
I have two big concerns with this initiative. The first — about color coding on the website — is addressed in a tweet here, and the obvious fix would be to adopt a color scheme on the website similar to the one used in the Matplotlib
plot_rt()
function, which is not nearly as binary.The second concern is about smoothing (or rolling average, or time series gaussian filter, which are used interchangeably here) below.
The problem with centering the rolling average
Try executing this code in your notebook, first with line 17 (
center=True
) present, and then commented out.Slightly modified notebook cell (toggled for space)
```python state_name = 'NY' # We will compute the smoothed lines, and the resulting R_ts, 6 times # First with the data up to the latest day, then up to one day before # that day, and so on... day_lags = [0,1,2,3,4,5] # Using a different name to avoid interfere with prepare_cases() def prepare_cases_new(cases, cutoff=25): # Unrelated note: new cases data are available in the COVID # Tracking Project in the column "positiveIncrease", so this line # could go, in theory. new_cases = cases.diff() smoothed = new_cases.rolling(7, win_type="gaussian", # The offending line is below. center=True, min_periods=1).mean(std=2).round() idx_start = np.searchsorted(smoothed, cutoff) smoothed = smoothed.iloc[idx_start:] original = new_cases.loc[smoothed.index] return original, smoothed def prepare_cases_wrap(day_lag): lagidx = -day_lag if day_lag != 0 else None # We skip cases that are after our cutoff (or lag) cases = states.xs(state_name).rename(f"{state_name} cases")[:lagidx] # The new function is inserted here original, smoothed = prepare_cases_new(cases) return {"original": original, "smoothed": smoothed} data = list(map(prepare_cases_wrap, day_lags)) original = data[0]["original"] original[-21:].plot(title=f"{state_name} New Cases per Day", c='k', linestyle=':', alpha=.5, label='Actual', legend=True, figsize=(500/72, 300/72)) for day_lag in day_lags: smoothed = data[day_lag]["smoothed"] smoothed[-21+day_lag:].plot( label=f"Smoothed with {day_lag} day lag", legend=True) fig, ax = plt.subplots(figsize=(600/72,400/72)) # Display the Rt history with the different time cut-offs for day_lag in day_lags: smoothed = data[day_lag]["smoothed"] posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25) hdis = highest_density_interval(posteriors, p=.9) most_likely = posteriors.idxmax().rename('ML') result = pd.concat([most_likely, hdis], axis=1) plot_rt(result, ax, state_name) ax.set_title(f'Real-time $R_t$ for {state_name}') ax.set_xlim(original.index.get_level_values('date')[-1]+pd.Timedelta(days=-21), original.index.get_level_values('date')[-1]+pd.Timedelta(days=1)) ax.xaxis.set_major_locator(mdates.WeekdayLocator()) ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d')) ```
Note: the graphs below only show the 3 most recent weeks, for clarity's sake. There is no change to the underlying computation otherwise.
That code will first produce the following graph:
Each colored line represents the smoothed cases made with data up to [5, 4, ..., 0] days before the latest data. The issue here is that they lines quite wildly.
For instance, here is what the model says about the amount of new cases on April 16:
Code to find the precise figures (click to toggle)
```python for date in [16, 15, 14]: print(f"For April {date}:") for x in range(6): try: new_cases = data[x]["smoothed"][f"2020-4-{date}"] diff = new_cases - data[x]["smoothed"][f"2020-4-{date-1}"] except KeyError: continue else: print(f"with {x} day lag: {int(new_cases)} new cases. change: {int(diff)}") print(f"") ```
In other words, the model is re-writing history every day. That's not great for something that is meant to be tracking Rt incrementally. Especially since, in turn, the estimates for Rt vary wildly as well:
Note: which line corresponds to which time cutoff / lag should be pretty self-explanatory.
So, someone checking rt.live:
Same story for Rt for April 16:
Code for the confidence intervals (click to toggle)
```python for day_lag in day_lags: smoothed = data[day_lag]["smoothed"] posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25) # Use .975 for 95% CI, .9 for 80%, .995 for 99%, etc. hdis = highest_density_interval(posteriors, p=.975) most_likely = posteriors.idxmax().rename('ML') result = pd.concat([most_likely, hdis], axis=1) try: print(day_lag, result.loc["2020-4-14"]) # or "2020-4-16" except KeyError: pass ```
I don't really think you can run a model that updates every day like that. Think about the models from 538: when you look at them every day, you would expect the latest numbers to change because new polling has come in. What you would not expect is the previous numbers to have changed as well, unless the underlying model had been revised overnight.
I think that's the impression that many readers will have looking at the website day to day, and see data from previous days change all of a sudden. They will think that the model is massively tweaked all the time, when in fact it is working as intended.
The perks of not centering the rolling average
The solution is simple: remove the
center=True
line from the previous code. These are the results:What changes in the COVID Tracking Project data every day at 4PM is not the data from previous days, merely the data from that specific day. The smoothing line should behave accordingly, and does so here.
As far as the Rt calculations are concerned, they are also much more stable. (I haven't done the math, but I obviously suspect that confidence intervals might have been incorrect before, and are now.)
Note: As the code listed at the top of the comment makes clear, I did not modify the sigma across runs. #23 references a change in sigma as something that might modify the curves, but it seems at first look that the impact is more minor than centering the rolling average.
Alternative
I'm not an epidemiologist, or a statistician by trade. Maybe centering the rolling average is the right thing to do, but the inevitable result is that Rts will change in a way that will be incomprehensible to readers that expect this to be an incremental dashboard.
This all the more that you're using a Bayesian approach, so it's not really great to revisit priors' priors. But maybe I'm wrong! I didn't see any formal justification of the centering though. If you do decide to go with the centering, however, then please freeze the history.
In other words, make sure that the Rt generated for a state on April 16 will be the same whether you're checking out rt.live on April 16, 17, 18, 19, and so on. That would make the charts even more unstable, and CIs would have to be revised, but this is better than seeing the history of Rts change drastically everyday.
Appendix: more results
Other results with no centering
For reference, same images generated with `center=True`