mrc-ide / helios

Simulating far UVC for pathogen control
https://mrc-ide.github.io/helios/
Other
5 stars 0 forks source link

integrating USA data into household and age group generation #97

Closed cwhittaker1000 closed 4 months ago

cwhittaker1000 commented 5 months ago

Partially addresses https://github.com/mrc-ide/helios/issues/83 - using household and age data from https://fred.publichealth.pitt.edu/syn_pops (which is in turn a synthetic population including household and age data that was created by RTI).

You have to pick a specific county so I've picked San Francisco somewhat arbitrarily to begin with. File sizes are huge so it'll be tough to do anything more general than that at this stage.

@athowes would love your thoughts on the PR and the above when you get a mo :)

cwhittaker1000 commented 5 months ago

Get started with trying this out by running

# Loading library
library(helios)

# Checking country is present as new argument and can be overridden 
model_params <- get_parameters(overrides = list(household_distribution_country = "UK"))
model_params$household_distribution_country
model_params <- get_parameters()
model_params$household_distribution_country

# Checking create variables works with USA and UK
uk_variables <- create_variables(get_parameters(overrides = list(household_distribution_country = "UK")))
uk_household_id <- uk_variables$variables_list$household$get_categories()
uk_household_sizes <- vector()
uk_households <- 1:max(as.numeric(uk_variables$variables_list$household$get_categories()))
for(i in uk_households) {
  uk_household_sizes[i] <- uk_variables$variables_list$household$get_size_of(as.character(i))
}
hist(uk_household_sizes)

usa_variables <- create_variables(get_parameters(overrides = list(household_distribution_country = "USA")))
usa_household_id <- usa_variables$variables_list$household$get_categories()
usa_household_sizes <- vector()
usa_households <- 1:max(as.numeric(usa_variables$variables_list$household$get_categories()))
for(i in usa_households) {
  usa_household_sizes[i] <- usa_variables$variables_list$household$get_size_of(as.character(i))
}
hist(usa_household_sizes)

## Checking run_simulation works with it
usa_results <- run_simulation(get_parameters(overrides = list(household_distribution_country = "UK"), archetype = "sars_cov_2"))
uk_results <- run_simulation(get_parameters(overrides = list(household_distribution_country = "USA"), archetype = "sars_cov_2"))
plot(usa_results$timestep, usa_results$E_new, type = "l")
lines(uk_results$timestep, uk_results$E_new, col = "red")
cwhittaker1000 commented 5 months ago

Note the SF household size distribution looks quite different to the UK. I've checked and the mean HH size and general shape matches that described here: image

It might be we want to pick somewhere else. Let me know thoughts.

cwhittaker1000 commented 5 months ago

Note - have updated so that the parameter country becomes household_distribution_country and we specify which country each of the locations' distributions are drawn from. Felt more exact and clearer that way.

athowes commented 5 months ago

Thanks @cwhittaker1000! Looks good to me. I verified that running the code works as expected.

Two things I picked up in my review:

  1. We could do some things to standardise across the UK and the US data such as naming the UK data with _uk and tranforming the UK data in the data-raw folder. I think these are quite minor. If you agree that they're an improvement happy for them to be a new issue.

  2. Need to call devtools::document() to render the documentation for these things below before merging:

Writing schools_england.Rd
Writing baseline_household_demographics.Rd
Writing baseline_household_demographics_usa.Rd

By the way, I'd recommend using barplot(table(uk_household_sizes)) rather than hist(uk_household_sizes) for integers. Here is UK:

image

And here is US:

image

For me it is confusing that in the UK nothing exists larger than a household of 6. The US data doesn't look too outlandish but I don't have a lot of domain expertise. The mean is pretty similar to the UK:

> mean(uk_household_sizes)
[1] 2.373606
> mean(usa_household_sizes)
[1] 2.220742

I guess we can expect more variance in epidemics with some large households. (By the way, do we have a way to track the location where people were infected? It could be interesting, espeically with turning on far UVC in some locations, to see how the distribution of location of infection changes.)

cwhittaker1000 commented 4 months ago

Thanks for all of the above @athowes - have standardised the data naming and pulled all the transformation into the DATASET.R file. Have also run devtools::document() now.

Get started with this PR by:

# Loading library
library(helios)

# Checking country is present as new argument and can be overridden 
model_params <- get_parameters(overrides = list(household_distribution_country = "UK"))
model_params$household_distribution_country
model_params <- get_parameters()
model_params$household_distribution_country

# Checking create variables works with USA and UK
uk_variables <- create_variables(get_parameters(overrides = list(household_distribution_country = "UK")))
uk_household_id <- uk_variables$variables_list$household$get_categories()
uk_household_sizes <- vector()
uk_households <- 1:max(as.numeric(uk_variables$variables_list$household$get_categories()))
for(i in uk_households) {
  uk_household_sizes[i] <- uk_variables$variables_list$household$get_size_of(as.character(i))
}
barplot(table(uk_household_sizes))

usa_variables <- create_variables(get_parameters(overrides = list(household_distribution_country = "USA")))
usa_household_id <- usa_variables$variables_list$household$get_categories()
usa_household_sizes <- vector()
usa_households <- 1:max(as.numeric(usa_variables$variables_list$household$get_categories()))
for(i in usa_households) {
  usa_household_sizes[i] <- usa_variables$variables_list$household$get_size_of(as.character(i))
}
barplot(table(usa_household_sizes))

## Checking run_simulation works with it
usa_results <- run_simulation(get_parameters(overrides = list(household_distribution_country = "UK"), archetype = "sars_cov_2"))
uk_results <- run_simulation(get_parameters(overrides = list(household_distribution_country = "USA"), archetype = "sars_cov_2"))
plot(usa_results$timestep, usa_results$E_new, type = "l")
lines(uk_results$timestep, uk_results$E_new, col = "red")

though with this working and addressing @athowes comments, I'll merge this shortly.