Age-disaggregated outputs can be slow

pwinskill commented 10 months ago

Specifying incidence of clinical or severe malaria disaggregated by age can slow down the model a lot. Having spoken with @giovannic, this is likely more to do with the implementation of estimating age-disaggregated outputs, rather than the overhead of rendering more output.

An example comparing a single age-band with 100 age bands:

library(malariasimulation)

# Parameters that specify more output with a single age group
p_single_age <- get_parameters(
  overrides = list(
    human_population = 1000
  )
) |>
  set_equilibrium(
    init_EIR = 5
  )
p_single_age$clinical_incidence_rendering_min_ages <- 0
p_single_age$clinical_incidence_rendering_max_ages <- 100 * 365

# Parameters that specify output with multiple age groups
p_multiple_age <- p_single_age
p_multiple_age$clinical_incidence_rendering_min_ages <- 0:99 * 365
p_multiple_age$clinical_incidence_rendering_max_ages <- 1:100 * 365

system.time({
  s_single_age <- run_simulation(
    timesteps = 365 * 2,
    parameters = p_single_age
  )
})

system.time({
  s_multiple_age <- run_simulation(
    timesteps = 365 * 2,
    parameters = p_multiple_age
  )
})

giovannic commented 10 months ago

Yes. The current implementation calls birth$get_index_of(a = timestep - upper, b = timestep - lower) for each age group each timestep.

I'd guess a c++ implementation which computes all of the stats from all the births and intervals will save ~99 RCpp calls per iteration.

giovannic commented 10 months ago

Found an R implementation for a personal project which is 6x faster feat/grid_render.

But it makes assumptions about the age groups which will be output (year wide age groups from 0 to 100). And some kinks need to be resolved, it doesn't output the same values as the existing method for some reason 🤔

Test, outputs and plots are attached below:

# Parameters that specify more output with a single age group
p_single_age <- get_parameters(
  overrides = list(
    human_population = 1000
  )
) |>
  set_equilibrium(
    init_EIR = 5
  )
p_single_age$prevalence_rendering_min_ages <- 0
p_single_age$prevalence_rendering_max_ages <- 100 * 365

# Parameters that specify output with multiple age groups
p_multiple_age <- p_single_age
p_multiple_age$prevalence_rendering_min_ages <- 0:99 * 365
p_multiple_age$prevalence_rendering_max_ages <- 1:100 * 365 - 1

p_grid <- p_single_age
p_grid$render_grid <- TRUE

set.seed(1)

system.time({
  s_single_age <- run_simulation(
    timesteps = 365 * 2,
    parameters = p_single_age
  )
})

set.seed(1)

system.time({
  s_grid <- run_simulation(
    timesteps = 365 * 2,
    parameters = p_grid
  )
})

jpeg('grid.jpg')
for (i in c(1, 50, 100)) {
  if (i == 1) {
    plot(s_grid$grid$timestep, s_grid$grid$n_detect[,i], type = 'l')
  } else {
    lines(s_grid$grid$timestep, s_grid$grid$n_detect[,i])
  }
}
dev.off()

set.seed(1)

system.time({
  s_multiple_age <- run_simulation(
    timesteps = 365 * 2,
    parameters = p_multiple_age
  )
})

jpeg('multiple.jpg')
for (i in c(1, 50, 100)) {
  if (i == 1) {
    plot(
      as.numeric(s_multiple_age$timestep),
      as.numeric(s_multiple_age[[paste0(
        'n_detect_',
        p_multiple_age$prevalence_rendering_min_ages[i],
        '_',
        p_multiple_age$prevalence_rendering_max_ages[i]
      )]]),
      type = 'l'
    )
  } else {
    lines(
      as.numeric(s_multiple_age$timestep),
      as.numeric(s_multiple_age[[paste0(
        'n_detect_',
        p_multiple_age$prevalence_rendering_min_ages[i],
        '_',
        p_multiple_age$prevalence_rendering_max_ages[i]
      )]])
    )
  }
}
dev.off()

outputs:

root@c40d0e961b76:/opt# Rscript test.R
   user  system elapsed              # single
  4.994   0.000   4.995
   user  system elapsed              # grid
  6.935   0.000   6.910
   user  system elapsed              # multiple
 46.173   0.000  46.169

Multiple:

Grid: grid

giovannic commented 10 months ago

It was an RNG issue

multiple grid

mrc-ide / malariasimulation

Age-disaggregated outputs can be slow #275