Closed DataStrategist closed 4 years ago
Hi there. Not sure what snippet of data you are showing here. In any case, there are some inconsistencies in JHU CSSE case data and they are already being discusses in their repo. To get a systematic overview you can do something like
library(tidycovid19)
library(dplyr)
df <- download_jhu_csse_covid19_data(cached = TRUE, silent = TRUE)
df %>%
group_by(iso3c) %>%
filter(recovered < lag(recovered) |
recovered > lead(recovered)) -> odd_recovered
df %>%
group_by(iso3c) %>%
filter(deaths < lag(deaths) |
deaths > lead(deaths)) -> odd_deaths
df %>%
group_by(iso3c) %>%
filter(confirmed < lag(confirmed) |
confirmed > lead(confirmed)) -> odd_confirmed
Serbia and Iceland (ISL) are both included in these lists. I decided against arbitrarily 'fixing' these issues as they might be representative of honest data miss-classifications.
should this be an optional input that could be used to download cleaned data? Defaulting to TRUE (ie remove data), but optionally removing it if one knows what they are doing? I think the chances of people not finding this strange data and using it for their analysis is greater than maintaining the data integrity? I'd be happy to PR you something if you want.
Thank you for your feedback. I thought about your suggestion but currently would prefer not to modify the data (not even optional). The key reason is that these data might well be correct given the time. A death that has been assigned to corona can later be found to be due to other causes. "Fixing history" should be done, if feasible, by the authoritative data source and not by an r package collecting data from public sources...
Taking day since first reported case shows both Serbia and Israel going back to a different reported number of deaths. Should I also open issue in John Hopkins?