population size not conserved with infinite age bin

MikeLydeamore commented 1 year ago

The age function has a linear decay model for the number of individuals in age classes beyond the final bin. This results in an unexpected (always lower?) number of people in the final bin.

The result of this is a non-symmetric contact matrix, which is especially unexpected in the context of #140.

Fix will involve re-doing this population model, probably the same as #52 and #75.

A gist with a mostly working reprex that shows the unexpected behaviour: https://gist.github.com/MikeLydeamore/2e67f703d812aba4d2d3bd6edf926b8f

The error is in predict_to_long_age_ranges.

njtierney commented 1 year ago

It would be really great if fixing this also helps resolve #75 as that has been doing my head in slightly

njtierney commented 1 year ago

here's a reprex of your gist with smaller age groups.

If I understand correctly, the plot at the end summarises the issue - the ages beyond the range of the data drop off a cliff.

library(tidyverse)
library(conmat)

age_breaks <- seq(0, 5, by = 1)

population <- data.frame(
  population = rep(200, times = length(age_breaks)),
  age = age_breaks
) |>
  as_conmat_population(
    age = age,
    population = population
  )

population
#> # A tibble: 6 × 2 (conmat_population)
#>  - age: age
#>  - population: population
#>   population   age
#>        <dbl> <dbl>
#> 1        200     0
#> 2        200     1
#> 3        200     2
#> 4        200     3
#> 5        200     4
#> 6        200     5

age_breaks <- c(age_breaks, Inf)
contact_model <- polymod_setting_models$home

age <- age(population)
age_var <- age_label(population)
population <- population %>% dplyr::arrange(!!age)

# this could be changed to a function for lower age limit
age_min_integration <- min(population[[age_var]])
bin_widths <- diff(population[[age_var]])
final_bin_width <- bin_widths[length(bin_widths)]
age_max_integration <- max(population[[age_var]]) + final_bin_width

# need to check we are not predicting to 0 populations (interpolator can
# predict 0 values, then the aggregated ages get screwed up)
pop_fun <- get_age_population_function(population)
ages <- age_min_integration:age_max_integration
valid <- pop_fun(ages) > 0
age_min_integration <- min(ages[valid])
age_max_integration <- max(ages[valid])

pred_1y <- predict_contacts_1y(
  model = contact_model,
  population = population,
  # these two arguments could be changed by just taking in the age vector
  # and then doing that step above internally
  age_min = age_min_integration,
  age_max = age_max_integration
)

pred_1y
#> # A tibble: 49 × 4
#>    age_from age_to  contacts se_contacts
#>       <dbl>  <dbl> <dbl[1d]>   <dbl[1d]>
#>  1        0      0     1.32       0.104 
#>  2        0      1     1.35       0.0899
#>  3        0      2     1.34       0.0759
#>  4        0      3     1.28       0.0636
#>  5        0      4     1.19       0.0538
#>  6        0      5     0.723      0.0312
#>  7        0      6     0.322      0.0141
#>  8        1      0     1.35       0.0899
#>  9        1      1     1.46       0.0971
#> 10        1      2     1.51       0.0824
#> # ℹ 39 more rows

# get function for 1y age populations in this country
age_population_function <- get_age_population_function(population)

a <- pred_1y %>%
  dplyr::mutate(
    age_group_to = cut(
      pmax(0.1, age_to),
      age_breaks,
      right = FALSE
    )
  ) %>%
  dplyr::filter(
    !is.na(age_group_to)
  ) %>%
  # sum the number of contacts to the 'to' age groups, for each integer
  # participant age
  dplyr::group_by(
    age_from,
    age_group_to
  ) %>%
  dplyr::summarise(
    contacts = sum(contacts),
    .groups = "drop"
  ) %>%
  # *average* the total contacts within the 'from' contacts, weighted by the
  # population distribution (to get contacts for the population-average ember of
  # that age group)
  dplyr::mutate(
    pop_age_from = age_population_function(age_from),
    age_group_from = cut(
      pmax(0.1, age_from),
      age_breaks,
      right = FALSE
    )
  ) %>%
  dplyr::filter(
    !is.na(age_group_from)
  ) %>%
  dplyr::group_by(
    age_group_from,
    age_group_to
  )

ggplot(a,
       aes(age_from,
           pop_age_from)) + 
  geom_line()

^{Created on 2023-04-12 with reprex v2.0.2}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.3 (2023-03-15) #> os macOS Ventura 13.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Australia/Hobart #> date 2023-04-12 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> conmat * 0.0.2.9000 2023-04-12 [1] local #> curl 5.0.0 2023-01-12 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.1.1 2023-03-22 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0) #> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gtable 0.3.3 2023-03-21 [1] CRAN (R 4.2.0) #> highr 0.10 2022-12-22 [1] CRAN (R 4.2.0) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.2.0) #> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.2.0) #> httr 1.4.5 2023-02-24 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0) #> lattice 0.21-8 2023-04-05 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> Matrix 1.5-4 2023-04-04 [1] CRAN (R 4.2.0) #> mgcv 1.8-42 2023-03-02 [1] CRAN (R 4.2.3) #> mime 0.12 2021-09-28 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> nlme 3.1-162 2023-01-31 [1] CRAN (R 4.2.3) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0) #> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.9.1 2023-03-04 [1] CRAN (R 4.2.0) #> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.2.0) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.38 2023-03-24 [1] CRAN (R 4.2.0) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

I'm not sure the best way to solve this, but as you pointed out, this interpolation is with predict_to_long_age_ranges,

https://github.com/njtierney/conmat/blob/master/R/get-age-population-function-internals.R#L109-L118

which takes a model fit with smooth.spline:

https://github.com/njtierney/conmat/blob/master/R/get-age-population-function-internals.R#L130-L139

I need to explore these steps, I wonder if the issue is in the smooth spline or predict_to_long_age_ranges

MikeLydeamore commented 1 year ago

We've isolated this down to build_lookup_populations which takes in a vector of age breaks and returns the appropriate population size. The difficulty is predict_to_long_age_ranges takes the "infinite" final bin and expands it out to a finite one, but then you have more age breaks than you're expecting.

This, I think, is because age_max inside predict_contacts_1y can be less than the maximum produced by predict_to_long_age_ranges (which is called in the same function!).

A decision would need to be made as to what to do here - we could:

a) Throw an error that the age_max is too low b) Throw a warning that age_max is too low and adjust it c) Throw a warning that age_max is too low, and redistribute those beyond the maximum to the remaining categories proportionally.

My intuition is that c) is the easiest and most sensible. So, for your example, only age 7 is beyond the maximum, so we just add those individuals into the age 6 class.

However, this leads to some... interesting behaviour if the tail is long. If I replace your age_breaks with age_breaks <- seq(0, 80, by=10) then I get a plot that looks like this:

which is because there's a bunch of population going out a long way to the right... Here 90 is the integration cutoff (10 years past the last bin):

r$> a %>% ungroup() %>% select(age_from, pop_age_from) %>% distinct() %>% tail()                                                                                               
# A tibble: 6 × 2
  age_from pop_age_from
     <int>        <dbl>
1       85         14.3
2       86         13.3
3       87         12.4
4       88         11.4
5       89         10.5
6       90         52.4

So, that leads me to think that c) isn't an option, and I don't see b) as being an option because it will cause problems elsewhere, so that only leaves a), which perhaps isn't the most ideal...

The only other thing I can think of is forcing max_age to be large - we filter off anything negative anyway so we're left with only >0 population boxes.

njtierney commented 8 months ago

So, one of the potenyial reasons this can cause problems or concerns is when the user is inputting a conmat population into conmat for simulation purposes, say an SIR model.

idem-lab / conmat

population size not conserved with infinite age bin #143