timriffe / covid_age

COVerAGE-DB: COVID-19 cases, deaths, and tests by age and sex
Other
56 stars 30 forks source link

Time series of cumulative count could have bumps, and not always increasing #123

Open liuyanguu opened 3 years ago

liuyanguu commented 3 years ago

I just realized this interesting issue since my colleagues were asking for time series of cases/deaths from the database. Because the database was built in an accumulative way, it is possible to get time series for some countries that (luckily) have frequently data updates.

This is not common as data for most countries are quite consistent. But happened in several countries like the deaths of Indonesia. I plotted the deaths against different update dates and they should monotonically increase as in most countries. I assume this bump is caused by a change in the data source? image

The effect is less obvious if zoom out to all age groups: image

Just to give some other examples: image image image

library(covidAgeData)
library(data.table)
library(ggplot2)
dt5_ori <- covidAgeData::download_covid(data = "Output_5", temp = TRUE,
                          verbose = FALSE, progress = FALSE, return = "data.table")
c_list <- sort(unique(dt5_ori$Country))
plot_by_c <- function(country0, measure0 = "Deaths"){
  dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All"  & !is.na(get(measure0))]
  # dt5_c <- dt5_ori[Country == country0 & Sex=="b" & Region == "All" & Age <=20 & !is.na(get(measure0))]
  if(nrow(dt5_c) == 0) return(NULL)
  dt5_c[, Date:= as.Date(Date, format = "%d.%m.%Y")]
  dt5_c[, Age:=factor(as.factor(Age), levels = seq(0, 100 ,by = 5))]
  g_IDN <- ggplot(data = dt5_c, aes_string(x = "Date", y = measure0, color = "Age", group = "Age")) +
    # geom_bar(stat="identity", width=0.5, show.legend = FALSE, color = "#0058AB") +
    geom_line() +
    labs(x = "", y = "") + 
    scale_x_date(date_labels = "%Y-%m") +
    scale_y_continuous(expand = c(0,0)) +
    ggtitle(paste(country0, "-", measure0)) + 
    theme_classic() 
  return(g_IDN)
}
plot_by_c("Indonesia", measure0 = "Deaths")
# too see all the countries 
plist <- invisible(lapply(c_list, plot_by_c, measure0 = "Deaths"))
plist <- plist[!sapply(plist, is.null)]
plist <- lapply(plist, ggplotGrob)
ggsave(filename = "time series of deaths_all_age.pdf",
       plot = gridExtra::marrangeGrob(grobs = plist, ncol = 2, nrow = 2), width = 20, height = 15)
timriffe commented 3 years ago

Thanks @liuyanguu for reporting. It would seem some of these artifacts have different causes, a mix of inconsistent sources, and surely a few manual data entry errors (Jordan). To pick out the manual entry errors, jumps in daily fractions where one age pops in the opposite direction of the others serves as a good indicator. If there is a daily spike but no change in age-specific fractions, then it's a potential error in the registered total. In that case, scaling to an external consistent series of totals would cure it (we can do this internally once it's identified). Kazahkstan looks like it needs its own investigation. Let's leave this issue open as cases are addressed. cc-ing @jessicadonzowa @kikeacosta