joachim-gassen / tidycovid19

{tidycovid19}: An R Package to Download, Tidy and Visualize Covid-19 Related Data
https://joachim-gassen.github.io/tidycovid19/
Other
146 stars 44 forks source link

Israel and Serbia data seem wrong #3

Closed DataStrategist closed 4 years ago

DataStrategist commented 4 years ago

image

Taking day since first reported case shows both Serbia and Israel going back to a different reported number of deaths. Should I also open issue in John Hopkins?

joachim-gassen commented 4 years ago

Hi there. Not sure what snippet of data you are showing here. In any case, there are some inconsistencies in JHU CSSE case data and they are already being discusses in their repo. To get a systematic overview you can do something like

library(tidycovid19)
library(dplyr)

df <- download_jhu_csse_covid19_data(cached = TRUE, silent = TRUE)

df %>%
  group_by(iso3c) %>%
  filter(recovered < lag(recovered) |
           recovered > lead(recovered)) -> odd_recovered

df %>%
  group_by(iso3c) %>%
  filter(deaths < lag(deaths) |
           deaths > lead(deaths)) -> odd_deaths

df %>%
  group_by(iso3c) %>%
  filter(confirmed < lag(confirmed) |
           confirmed > lead(confirmed)) -> odd_confirmed

Serbia and Iceland (ISL) are both included in these lists. I decided against arbitrarily 'fixing' these issues as they might be representative of honest data miss-classifications.

DataStrategist commented 4 years ago

should this be an optional input that could be used to download cleaned data? Defaulting to TRUE (ie remove data), but optionally removing it if one knows what they are doing? I think the chances of people not finding this strange data and using it for their analysis is greater than maintaining the data integrity? I'd be happy to PR you something if you want.

joachim-gassen commented 4 years ago

Thank you for your feedback. I thought about your suggestion but currently would prefer not to modify the data (not even optional). The key reason is that these data might well be correct given the time. A death that has been assigned to corona can later be found to be due to other causes. "Fixing history" should be done, if feasible, by the authoritative data source and not by an r package collecting data from public sources...