k-sys / covid-19

A collection of work related to COVID-19
1.38k stars 432 forks source link

Worth attempting to handle the undercounting/testing issue? #32

Open kmedved opened 4 years ago

kmedved commented 4 years ago

First, I want to say this is a wonderful project @k-sys - thank you very much for doing this. As I think everyone is aware, there are issues with calculating R in real-time based on confirmed case counts because of issues with testing capacity however. So R could be falling dramatically, but if testing is only capable of capturing 5% of actual cases, you may not see a measurable decrease in confirmed case counts as changes in testing capacity swamp the confirmed case counts.

As result, tracking R based on confirmed case is mostly just tracking testing capacity. As an example, here's the chart of New York's R based on the current approach.

image

That spike around March 18 coincides perfectly with when New York massively ramped up their testing capacity, going from an average of about ~1200 tests a day over the preceding 5 days to about 11,000 a day (almost a 10x increase).

One simple fix is to switch to counting deaths, which should be more reliable, but this will introduce a lag of about 2 weeks into the data, which obviously defeats the purpose of tracking R in real time.

As an alternative, I'd like to propose another method I've been experimenting with which seems to give more sensible results, albeit at the cost of introducing some editorial judgment into the data. This approach is to:

  1. Using a 14 day lagging period between deaths and confirmed case counts, and some estimate of IFR (e.g. 0.5%), back into an estimate of how many 'true' cases there were up to 14 days ago.

  2. Within the most-recent 14 day window, estimate the number of new cases which would be coming in if testing were flat within that 14-day window by tracking changes in the percentage of tests coming back positive. So if using Step 1 above, we estimate that New York had 100K new cases on April 7th at a 43% positive rate, and the positive rate dropped to 41% on April 8th, we would estimate approximately 95.3K cases on April 8th. Repeat daily.

Note that while IFR estimates are heavily debated, this approach is actually not sensitive to the IFR estimate since R is tracking case growth, not total cases. It is very sensitive to the 14 day window however.

Using this approach generates a chart like this New York:

image

For context, the NBA season and most mass events were suspended around March 11th. Many business also went work-from-home around then, although formal shelter-in-place orders were not issued until later. I find this chart somewhat more plausible looking, especially for earlier in March.

I'm using Kalman filter here to smooth out differences in daily testing percentages. This approach could be further improved by using a Gaussian distribution to estimate 'true' cases in the past using deaths, rather than assuming a flat 14 window.

It may also be possible to create a better estimate of 'true' cases within the 14 day window rather than assuming they vary linearly as a function of percentage of tests coming back positive (i.e., by adding features about the number of tests performed on a given day, or hospitalization recorded some number of days ago, etc...). I have experimented with some GBDT models like this, but have not been impressed with the results so far. @kpelechrinis has suggested using a Gopertz curve to project deaths forward as well, which performs well out-of-sample, but is still using fundamentally lagging data.

I appreciate there are possible downsides to this approach, namely that this will turn this into an opinionated model of real-time R, rather than just observing some ground truth. But I think there's also value to trying to account for these testing issues.

I don't know if that's out of the scope of the vision of this project however. So I'm interested in getting thoughts as to whether this would be worth trying to incorporate.

cayleytorgeson commented 4 years ago

@kmedved gets to the heart of the challenges to this implementation. Given the variance in testing, the flaw of relying on reported cases is too big to overcome.

California went from R of 1.1 a week ago to 0.73 yesterday to .44 today. The model is not robust.

Deaths are a lagging but sadly reliable input to consider.

The model currently gives a dangerously false sense of precision and is highly misleading at a time when states are considering easing restrictions. I would urge you to reconsider publishing the current model at this time.

kmedved commented 4 years ago

Not that it's my place to weigh in, but I disagree with the notion that this model should be taken down; I agree there are limitations as implemented due to data quality, but that's a reason to do more work trying to process the data. I think it would be useful to have a discussion about the best ways to do that.

cayleytorgeson commented 4 years ago

I would agree if it the variable was called something other than R. As it is currently implemented, it isn't R in any meaningful sense of the term.

kpelechrinis commented 4 years ago

@cayleytorgeson I see where you are coming from. However, the method implemented here is really what it is out there in epi literature. This is how they estimate the real time R (one of the papers is actually cited at the notebook and another one I have been looking at is below), which always - it is not only a matter of covid-19 - will have undercounting data. So I think of course anyone that does anything with these type of messy data needs to be cautious but that's what epi literature does (during the epidemic. After the epidemic there are obviously other ways to get better estimate for basic reproductive numbers etc.). Just my 2c

[1] https://www.pnas.org/content/pnas/early/2018/11/21/1811115115.full.pdf

cayleytorgeson commented 4 years ago

@kpelechrinis Your points are all good and I am not disagreeing. My comment is that despite being a faithful implementation of an academic approach, the current model is not working well under real world conditions. It's an interesting discussion about what one does about it, but that is up to the owners.

My 2c is that leaving it as it stands will confine it to an academic exercise.

Nectarineimp commented 4 years ago

I've been doing data science since 2005. Two of those years were with Native American casino data (the most complete and robust data possibly available due to regulatory requirements.) Even that data has problems. To tell the casinos they can't do anything with it because of imperfections is, of course, preposterous. The same goes here. The quality of the data is not perfect. As I've shown, even the most perfect data collection is also not perfect. You work with what you have. In this case, we have the same data that every decision maker is working with. We have assumed error shown. I think this is very useful on many levels. It should not be taken down. It should remain up and be constantly improved.

Here is a true story. NASA had a tool bag they wanted to certify for flight on spaceships. If the zipper fails, then dangerous tools could be flying about in zero gravity, or under high acceleration which could quickly terminate the flight with a total loss situation. They had one example bag to test, and they automated the test to open and close the bag 8000 times. Did they do enough to certify that the bag was safe for 50 flights if there were zero failures? A sample size of 1, 8,000 tests, 0 failures. This data is imperfect and if they certify it and it fails its billions of dollars in a lost mission, and lives lost as well. The answer is yes, they did. Risk analysis is not easy, but this is a pretty simple case. Imperfect data did not prevent NASA from feeling safe about this bag, for the purpose it was to serve and its place on the mission.

As a final note, I actually think the data is good. Anyone with serious symptoms is being counted. Some may be dying at home, but hospital capacity did not hit overload. Anyone sick with symptoms has been treated. The reporting on this, while not perfect, is still very good. What we don't have good data on, yet, is how many people are asymptomatic. That is why there is a great push for testing. My original model did work to determine how many were asymptomatic from known data. It also contained a wide spread based upon two different studies, one of which I've since rejected as have many others who are more knowledgeable about virology than I am. The current model does a good job of determining a basis for a forecast of how many people will be infected and show symptoms. That's my defense of this exemplary work.

nealmcb commented 4 years ago

I agree that its problematic to trust testing data, given how much the testing rates are going both up and down depending on capacity, policy etc.

The recommendation in this very nice Lin Lab overview, updated April 2, is to incorporate hospitalization data (about 2 weeks behind infection data), and validate it with mortality data (about 3-4 weeks behind infection data): https://drive.google.com/file/d/1ZaiDO87me4puBte-8VytcSRtpQ3PVpkK/view

I would think that a Bayesian approach could incorporate all that, but I don't know quite how.

missing-semicolon commented 4 years ago

I’ve been giving some thought to this issue and was thinking that the model could be tweaked to allow for time-fluctuating variance that is itself a function of tests administered. There are additional things we can do such as assume that sparse testing induces a an indeterminate bias but an interesting first step would be to see how different results would be if we allowed this to enter.

dsjoerg commented 4 years ago

@missing-semicolon did you see this on the rt.live home page? "4/26 model update: new Rt graphs reflect corrections for the amount of testing done over time in any given state. An increase or decrease in testing should not affect accuracy of Rt values in the future. This correction has significantly improved Rt values in most states."

They have not released the code for this yet but looking forward to seeing it, then we can revisit this issue if it's not thoroughly resolved.

ivandebono commented 4 years ago

Data for deaths is not an accurate indicator of infections. Some countries allow doctors to put Covid as the cause of death based on mere suspicion, without testing for the virus. So the figures over-reporting Covid deaths. On the other hand, if there are undetected cases where the patient dies, then the figures are under-reporting. But undetected cases bring us back to the problem of reported cases and testing.

An additional problem with using death is that there is no reason why the time interval between infection and death should be constant throughout the epidemic. It changes depending on diagnosis, medical treatment, available facilities, shifting patient demographics, and various other factors.

In the absence of accurate error models, I would suggest keeping reported infections as the observable. That way at least we can identify our source of error.