reichlab / covid19-forecast-hub

Projections of COVID-19, in standardized format
https://covid19forecasthub.org
Other
446 stars 326 forks source link

Truth data inconsistent with JHU CSSE data #528

Closed youyanggu closed 4 years ago

youyanggu commented 4 years ago

Hi,

I'm wondering where you guys are getting the truth data from. We assumed it is from the CSSE Daily Reports: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

But looking at the truth data here: https://github.com/reichlab/covid19-forecast-hub/blob/master/data-truth/truth-Cumulative%20Deaths.csv, I noticed numerous inconsistencies with the JHU CSSE data.

For example, the truth file recorded 32,944 US deaths for 2020-04-15, but JHU shows 28,325. That's a difference of 4,619 deaths. As a recent example, the truth file recorded 848 deaths for Missouri for 2020-06-06, but JHU recorded 815 deaths.

In trying to migrate our truth file from our own to the one in this repository, I noticed around 300 rows where the difference is greater than 10. I was wondering if there is something that I am misunderstanding about the truth data. Having the correct "ground truth" values is very important in making forecasts, so I want to make sure that I am correctly understanding how to compute those values.

I attached a file below with all differences larger than 10 compared to our own source of truth. Note that our source of truth may differ slightly but not significantly from JHU truth (e.g. we don't count the 3 deaths from Grand Princess, and we avoid negative incident deaths).

Thanks, Youyang

truth_diff_yyg_reich.txt

nickreich commented 4 years ago

Hi @youyanggu - thanks for raising this issue/question. Some details about what truth data we are using can be found here. It does seem like we are maybe using different files from the same source.

I think that the data that we are using is the official record that INCLUDES REVISIONS to data. The daily reports, as JHU states on their website are not updated, to maintain a record of the raw data, but the timeseries data that we are using is updated as the states update and backfill their reporting. Specifically they say:

This folder contains daily time series summary tables, including confirmed, deaths and recovered. All data is read in from the daily case report. The time series tables are subject to be updated if inaccuracies are identified in our historical data. The daily reports will not be adjusted in these instances to maintain a record of raw data.

This issue of revisions to surveillance data is a very common theme in infectious disease modeling and is an issue that we have had to contend with in forecasting dengue fever and flu as well.

I am pretty sure that the timeseries version of the JHU data is the correct data to be using as ground truth data, so I am closing this issue. Happy to discuss more or revisit if you have other thoughts.

youyanggu commented 4 years ago

That makes sense - thanks Nick! I guess the file I attached would be a view of the differences between the daily reports and the time series.

It may be helpful to clarify on the README that the ground truth is the JHU time series data based on the daily reports (csse_covid_19_time_series) rather than the raw daily reports (csse_covid_19_daily_reports). I know the link is to the time series so that's my fault for not clicking on it earlier.