ulklc / covid19-timeseries

Covid19 timeseries data store
MIT License
38 stars 9 forks source link

Reported number of deaths for USA too high compared with other sources #17

Closed nokyotsu closed 4 years ago

nokyotsu commented 4 years ago

The number of deaths for US appears too high, since April 16 its deviating by more than 4K compared with the latest data on wikipedia. Is there a mistake somewhere?

            ulklc  wikipedia   diff
day                               
2020-04-09  16636      16466   +170
2020-04-10  18695      18544   +151
2020-04-11  20555      20454   +101
2020-04-12  22101      21936   +165
2020-04-13  23610      23398   +212
2020-04-14  25975      25776   +199
2020-04-15  28506      28214   +292
2020-04-16  34580      30355  +4225
2020-04-17  37133      32435  +4698
2020-04-18  38937      34178  +4759
ulklc commented 4 years ago

Hi @nokyotsu , Sources are not matching, for ex https://www.worldometers.info/coronavirus/country/us/ -> 39k https://bnonews.com/index.php/2020/04/the-latest-coronavirus-cases/ -> 38k https://en.wikipedia.org/wiki/Template:2019%E2%80%9320_coronavirus_pandemic_data/United_States_medical_cases_chart -> 34k https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv -> 38k

Not sure, but so many sources almost matching with our data.

Thanks

nokyotsu commented 4 years ago

I understand the values from various sources may not match exactly, but the values show a very large sudden jump on April 16 which, I believe, is not there in any of the sources.

chrisjbillington commented 4 years ago

It seems likely to be the extra 3700 ish deaths added by NY recently as a revision:

https://www.theguardian.com/us-news/2020/apr/15/new-york-city-coronavirus-death-toll-jumps-revised-count

Not sure why some data sources have it and some don't. You can see some discussion on the talk page of the wikipedia link about the discrepancy.

Perhaps legitimate disagreement over whether data sources want to count only the outcomes of tested cases or not. Since these deaths were not tested, the individuals presumably aren't in the case counts. Which leads to weird possibilities such as deaths and recoveries adding up to more than confirmed cases in the future, or the case fatality rate calculation right now being biased higher since the deaths are in the numerator but not the denominator.