Open MikeLydeamore opened 1 year ago
It would be really great if fixing this also helps resolve #75 as that has been doing my head in slightly
here's a reprex of your gist with smaller age groups.
If I understand correctly, the plot at the end summarises the issue - the ages beyond the range of the data drop off a cliff.
library(tidyverse)
library(conmat)
age_breaks <- seq(0, 5, by = 1)
population <- data.frame(
population = rep(200, times = length(age_breaks)),
age = age_breaks
) |>
as_conmat_population(
age = age,
population = population
)
population
#> # A tibble: 6 × 2 (conmat_population)
#> - age: age
#> - population: population
#> population age
#> <dbl> <dbl>
#> 1 200 0
#> 2 200 1
#> 3 200 2
#> 4 200 3
#> 5 200 4
#> 6 200 5
age_breaks <- c(age_breaks, Inf)
contact_model <- polymod_setting_models$home
age <- age(population)
age_var <- age_label(population)
population <- population %>% dplyr::arrange(!!age)
# this could be changed to a function for lower age limit
age_min_integration <- min(population[[age_var]])
bin_widths <- diff(population[[age_var]])
final_bin_width <- bin_widths[length(bin_widths)]
age_max_integration <- max(population[[age_var]]) + final_bin_width
# need to check we are not predicting to 0 populations (interpolator can
# predict 0 values, then the aggregated ages get screwed up)
pop_fun <- get_age_population_function(population)
ages <- age_min_integration:age_max_integration
valid <- pop_fun(ages) > 0
age_min_integration <- min(ages[valid])
age_max_integration <- max(ages[valid])
pred_1y <- predict_contacts_1y(
model = contact_model,
population = population,
# these two arguments could be changed by just taking in the age vector
# and then doing that step above internally
age_min = age_min_integration,
age_max = age_max_integration
)
pred_1y
#> # A tibble: 49 × 4
#> age_from age_to contacts se_contacts
#> <dbl> <dbl> <dbl[1d]> <dbl[1d]>
#> 1 0 0 1.32 0.104
#> 2 0 1 1.35 0.0899
#> 3 0 2 1.34 0.0759
#> 4 0 3 1.28 0.0636
#> 5 0 4 1.19 0.0538
#> 6 0 5 0.723 0.0312
#> 7 0 6 0.322 0.0141
#> 8 1 0 1.35 0.0899
#> 9 1 1 1.46 0.0971
#> 10 1 2 1.51 0.0824
#> # ℹ 39 more rows
# get function for 1y age populations in this country
age_population_function <- get_age_population_function(population)
a <- pred_1y %>%
dplyr::mutate(
age_group_to = cut(
pmax(0.1, age_to),
age_breaks,
right = FALSE
)
) %>%
dplyr::filter(
!is.na(age_group_to)
) %>%
# sum the number of contacts to the 'to' age groups, for each integer
# participant age
dplyr::group_by(
age_from,
age_group_to
) %>%
dplyr::summarise(
contacts = sum(contacts),
.groups = "drop"
) %>%
# *average* the total contacts within the 'from' contacts, weighted by the
# population distribution (to get contacts for the population-average ember of
# that age group)
dplyr::mutate(
pop_age_from = age_population_function(age_from),
age_group_from = cut(
pmax(0.1, age_from),
age_breaks,
right = FALSE
)
) %>%
dplyr::filter(
!is.na(age_group_from)
) %>%
dplyr::group_by(
age_group_from,
age_group_to
)
ggplot(a,
aes(age_from,
pop_age_from)) +
geom_line()
Created on 2023-04-12 with reprex v2.0.2
I'm not sure the best way to solve this, but as you pointed out, this interpolation is with predict_to_long_age_ranges
,
https://github.com/njtierney/conmat/blob/master/R/get-age-population-function-internals.R#L109-L118
which takes a model fit with smooth.spline
:
https://github.com/njtierney/conmat/blob/master/R/get-age-population-function-internals.R#L130-L139
I need to explore these steps, I wonder if the issue is in the smooth spline or predict_to_long_age_ranges
We've isolated this down to build_lookup_populations
which takes in a vector of age breaks and returns the appropriate population size. The difficulty is predict_to_long_age_ranges
takes the "infinite" final bin and expands it out to a finite one, but then you have more age breaks than you're expecting.
This, I think, is because age_max
inside predict_contacts_1y
can be less than the maximum produced by predict_to_long_age_ranges
(which is called in the same function!).
A decision would need to be made as to what to do here - we could:
a) Throw an error that the age_max
is too low
b) Throw a warning that age_max
is too low and adjust it
c) Throw a warning that age_max
is too low, and redistribute those beyond the maximum to the remaining categories proportionally.
My intuition is that c) is the easiest and most sensible. So, for your example, only age 7 is beyond the maximum, so we just add those individuals into the age 6 class.
However, this leads to some... interesting behaviour if the tail is long. If I replace your age_breaks
with age_breaks <- seq(0, 80, by=10)
then I get a plot that looks like this:
which is because there's a bunch of population going out a long way to the right... Here 90 is the integration cutoff (10 years past the last bin):
r$> a %>% ungroup() %>% select(age_from, pop_age_from) %>% distinct() %>% tail()
# A tibble: 6 × 2
age_from pop_age_from
<int> <dbl>
1 85 14.3
2 86 13.3
3 87 12.4
4 88 11.4
5 89 10.5
6 90 52.4
So, that leads me to think that c) isn't an option, and I don't see b) as being an option because it will cause problems elsewhere, so that only leaves a), which perhaps isn't the most ideal...
The only other thing I can think of is forcing max_age
to be large - we filter off anything negative anyway so we're left with only >0 population boxes.
So, one of the potenyial reasons this can cause problems or concerns is when the user is inputting a conmat population into conmat for simulation purposes, say an SIR model.
The age function has a linear decay model for the number of individuals in age classes beyond the final bin. This results in an unexpected (always lower?) number of people in the final bin.
The result of this is a non-symmetric contact matrix, which is especially unexpected in the context of #140.
Fix will involve re-doing this population model, probably the same as #52 and #75.
A gist with a mostly working reprex that shows the unexpected behaviour: https://gist.github.com/MikeLydeamore/2e67f703d812aba4d2d3bd6edf926b8f
The error is in
predict_to_long_age_ranges
.