Outdated testcount data for Germany, incorrect Prophet forecast

w-flo commented 3 years ago

Hi,

Thanks for the interesting analysis on rtlive.de!

I realize detailed testcount data is not publicly available for Germany, so it's probably difficult to get a data update.

However, I'm afraid the Prophet forecast is not a great method to get a forecast for current testcount based on 4 week old data. As you can see, rtlive.de indicates a growing number of tests for Germany and the Bundesländer in the past 4 weeks, even though RKI numbers published for those weeks clearly shows the opposite trend: number of tests has been going down in recent weeks. This probably results in underestimating the infection count and R_t.

I'm not sure if there's a good way to fix this unless RKI is willing to give you new data.

Maybe just copy the last week of available data for each Bundesland and scale it according to the change in country-wide test numbers published by RKI for each week? That would probably under-estimate tests in Bavaria and over-estimate tests in other Bundesländer, since Bavaria still has free tests for all, but since there is no public data on a Bundesland level other than https://ars.rki.de/Content/COVID19/Main.aspx (which doesn't appear to be very useful), I don't know a better workaround.

w-flo commented 3 years ago

Just had an idea: The RKI lab surveillance reports (see link in my issue report) indicate a test positivity rate for each Bundesland on page 5. It's a bit hard to read, but should be possible, maybe with some GIMP help. The test positivity rate combined with number of cases reported by the Bundesland in that week could be used to estimate the number of tests.

Edit: Note the table on page 4 is not useful because it aggregates numbers for 10 weeks. Current positivity rate needs to be extracted from graphs on page 5.

michaelosthege commented 3 years ago

Hi @w-flo, It is clearly a data availability problem. As far as I know the weekly RKI reports are based on that same underlying data. Yesterday we got an update.

From a modeling perspective it should be possible to build a more complete PyMC3 model, maybe using the timeseers package, that models all of Germany in a hierarchical way, while also incorporating the different sums from the RKI report. It could be a fun challenge to build such a model and would certainly be a valuable contribution. However, all of this data already exists, so an alternative approach would be to scrape such numbers from local health authorities. At least for Berlin I saw such a plot once.

sdschulze commented 3 years ago

Is the nation-wide data easier to get, such that it would be possible to integrate it in the nation-wide plot?

michaelosthege commented 3 years ago

@sdschulze I am not aware of any data source other than the weekly RKI reports. And those are just weekly numbers and based on essentially the same underlying data. As I wrote earlier, one could incorporate this in a more sophisticated model.

MaPePeR commented 3 years ago

It seems like the weekly test counts are published a lot more frequent than the daily test counts (which is kind of sad/hilarious?). So maybe one can:

Take the known daily counts.
Make the prediction of daily counts as before.
Take the weekly test count (Potential Source: https://data.humdata.org/dataset/total-covid-19-tests-performed-by-country) and scale the predicted daily counts in a way the week-count of tests matches.
Repredict the test count for days for which we don't have a weekly test count using the sum-corrected data.

michaelosthege commented 3 years ago

@MaPePeR Yes. If you want to start, take a look at the ourworldindata.py data source. It also covers Germany.

Unfortunately I cannot share the CSV of daily German testcounts, but for comparison you can just use the data loader from Belgium (https://github.com/rtcovidlive/rtlive-global/blob/master/notebooks/Tutorial_dataloading.ipynb).

By the way, there are many other countries from the OWID dataset that we could support. If anybody contributes a data loader for the UK, Ireland or Spain, that would be very welcome!

MaPePeR commented 3 years ago

I can give it a try. My experience with panda is minimal, though. And this correction does not really work for the Bundesländer, because we don't have that data.

I think I'm going to ignore them for now and only try to correct the country data.

Other possibilities would be:

Scaling the Bundesländer with the same factors as the summed data
Using the last 2 weeks with daily data to estimate the percentage of tests in each Bundesland and multiply it with the known country sum.
Do two prophet predictions: one for the average daily tests in a week in Germany and one for daily tests in Bundesland divided by the average daily test in a week. Then multiply the predictions to get values for each Bundesland.

Biggest problem I see is making this correction somewhat transparent to the readers. What do you think?

EDIT: At the moment I'm tending to implement the latest idea of predicting the weekly tests and the distribution of tests across weekdays and regions separately. But someone else should take a look at it and think about if that's really a good idea.

MaPePeR commented 3 years ago

I attempted that 3rd Idea, but the prediction of weekly average tests didn't work out as planned. The prediciton of the "region factor" also doesn't look promising and prophet raised warnings i don't understand.

I kind of expected the prediction to look like the predicted tests for netherlands. They do not. (See [14] in my Notebook) prediction for netherlands from rtlive.de

michaelosthege commented 3 years ago

@MaPePeR you'll probably want to set this kwarg to False: https://github.com/rtcovidlive/rtlive-global/blob/master/rtlive/preprocessing.py#L108

MaPePeR commented 3 years ago

I don't think this change improved the prediction: With keep_data=True: With keep_data=False:

michaelosthege commented 3 years ago

@MaPePeR in your cell [12] the good looking yellow zig-zag line is df_be.diff(), but in [13] you're training the prophet model with f_be_weekly_avg_per_day

MaPePeR commented 3 years ago

For that part I don't want to predict the yellow zig-zag line, but the blue staircase, so I assume df_be_weekly_avg_per_day to be correct. I expected the result to kind of follow the staircase as seen for the netherland test prediction.

I cannot reproduce the forecast for netherlands:

result_df, results = data_nl.forecast_NL(ourworldindata.create_loader_function("NL")(pandas.Timestamp('today')))
pyplot.plot(result_df.xs('all')['predicted_new_tests'], label='predicted')
pyplot.plot(result_df.xs('all')['new_tests'], label='tests')
pyplot.legend()

So maybe there is something wrong with my prophet installation or they behave differently when called from a notepad? I followed the miniconda env instructions from the README.

michaelosthege commented 3 years ago

The plot in this comment looks like prophet did not converge.

Your last plot here looks correct. On Rtlive.de/global we use process_testcounts(..., future_days=10) which is why the orange line in the "Netherlands, 2021-01-08" plot goes further into the future than yours.

Maybe I'm not seeing the problem you're referring to.

MaPePeR commented 3 years ago

I now tried:

nl_data = ourworldindata.create_loader_function("NL")(pandas.Timestamp('today'))
result_df, results = data.process_testcounts('NL', nl_data, 10)
pyplot.plot(result_df.xs('all')['predicted_new_tests'], label='predicted')
pyplot.plot(result_df.xs('all')['new_tests'], label='tests')
pyplot.legend()

I was not expecting a saw-like prediction. But I now realized, that this might be a wrong assumption from my part and one only does not see this saw on rtlive.de/global because of graph scaling.

But I also realized, that my approach is wrong: I shouldn't use the prediction on data that is the same for all weekdays and then expect the prediction to also be the same on all weekdays. Instead I probably should use a predictor, that predicts the real weekly data instead of "fake" daily data.

MaPePeR commented 3 years ago

I've gone back to my first idea of scaling the prediction directly and I think it looks promising: https://github.com/MaPePeR/rtlive-global/blob/correct_prediction_with_weekly_sum/notebooks/Correct_daily_count_with_weekly_sum.ipynb

Again, this is on belgian data, but simulating the german missing data.

michaelosthege commented 3 years ago

@MaPePeR indeed, this looks very promising! Do I understand it correctly that you used a shortened daily data timeseries from BE to train and afterwards scale the prediction with the weekly sums of the full series?

Now to "get this into production" you'll have to make this a function with a very clearly defined signature (type hints + docstring). I suppose this function should take two dataframes - one with daily data and one with weekly sums?

MaPePeR commented 3 years ago

Yes, you understood correctly. I only shorten the daily data from BE to simulate the data situation we have in germany. I only scale the part between the dotted lines, because I assume the weekly data is also not available completely. So I don't scale the full series.

At the moment I'm thinking if it would be more robust to don't base it on weeks and instead allow for arbitrary gaps in the data. It needs two Series: the daily data and one with "weekly" sums or the weekly total/running sums. These could be provided in separate dataframes or in a single dataframe. To keep in the style of the existing methods I'd go for a single dataframe and adopt the get_testcounts_DE function to also add the oiwd-series to the frame.

Then it kind of depends if this method only should do the scaling or also the predicting. If it should not do the predicting itself, but act as a "postprocessing" step to the prediction it needs the predicted data and needs to know at which point the prediction starts/the known data ends.

At the moment this can only be used for the all region, because we don't have data for the other regions. (One could try to estimate those, but that's a totally different topic, so I would be happy to first improve the all region) So it should also be region-specific and only work on a single region, by either supplying the region name or only supplying the Series for the specific region.

MaPePeR commented 3 years ago

I choose the 3-Series method signature and made the necessary changes and created yet another notebook in order to test it with belgian data again.

Looks like it works, but this needs review and testing, as I wasn't able to test the real data_de source, obviously.

Changes to the data_de source The method is called in [11] in the Test-Notebook. This time i called the "real" get_data_BE andforecast_BE from data_be.

Not sure, if and when I should make a PR. Should I include any of the notebooks? I feel like they are not clean enough to include them.

MaPePeR commented 3 years ago

I also now attempted to improve the region total testcounts in that same Notebook. The results also look very promising to me.

Two attempts:

Red: Just scaling the other regions with the same factors as the all-region
Black: I estimated the distribution of tests across the regions by calculating the percentage of total tests in the last week of known daily data for each region. That percentage was then used to scale the week-reports-daily change, which was then added as projected growth on top of the cumulated sum of tests for each region. This only works if the cumulated sum of daily data and report data are identical. I don't know if that's the case for the german data. ~~It also probably still needs some work if the daily-data reports and weekly-data reports don't align on the same day.~~ (I think I fixed this)

The region with the worst approximation:

One of the better predictions:

See notebook for all others.

w-flo commented 3 years ago

Thanks for your work on this! I'm unable to review it though since I don't know anything about the implementation of this project right now, so I hope project owners will take a look. :-)

A similar approach could probably be used for each German state, but the weekly testcount report is not machine-readable (it's a PDF file) and they only publish the sum of tests for a sliding window of the past 10 weeks. By starting with their first report, testcount data for every single week could be inferred, I guess. Reports are at https://ars.rki.de/Content/COVID19/Main.aspx I guess the aggregate numbers in these PDF files won't necessarily match the "secret" daily testcount data the RKI provides rtlive.de with every couple of weeks, so there needs to be a state-specific scaling factor as well.

Right now the testcount prediction for SH is way off (really small because number of tests was decreasing in the weeks before Christmas) so rtlive.de estimates severe undertesting and a high R_t value, while testcount estimation for SN might be too high, leading to a smaller R_t and probably too small estimated infection rate. I hope RKI sends out new daily testcount data to rtlive.de soon to fix this.

lhelleckes commented 3 years ago

Thanks for opening the PR @MaPePeR and thanks for the great discussion here everyone. @michaelosthege and me will have a look at it as soon as possible. Regarding the data, we just made a request to RKI and hope to get the latest numbers soon.

michaelosthege commented 3 years ago

@MaPePeR we have since received new data from the RKI. The ArcGIS database is currently unavailable (they seem to have technical difficulties), which is why https://Rtlive.de was not yet updated with this new data.

We will check out your PR and compare the adjustments made by your function with the new data to see how much it could have improved the prediction.

sdschulze commented 3 years ago

There seems to be a fluke right now, with the prediction about twice as high as the data.

michaelosthege commented 3 years ago

There seems to be a fluke right now, with the prediction about twice as high as the data.

It's actually not a bug, but we just did not get around to update the plot with an informative legend. The change is related to #12 where we implemented correcting the testcounts with weekly sums.

tl;dr: The dataset of daily testcounts we get from the RKI does not cover all tests, because not all laboratories report their testing volume. We get this data about every month. In the weekly reports the RKI publishes a sum of all tests done per week for all of Germany. The 2x scaling is because we use this sum to scale up the testing profile to match the weekly sums reported publicly.

sdschulze commented 3 years ago

Thanks for the explanation! I'm not sure if it's related, but now the Rt values by RKI have disappeared from the plot.

michaelosthege commented 3 years ago

@sdschulze no, that's a new issue. See https://github.com/rtcovidlive/rtlive-global/issues/17 Any help would be greatly appreciated!

rtcovidlive / rtlive-global

Outdated testcount data for Germany, incorrect Prophet forecast #10