Running simulations using a table/dataframe of parameter values

mrc-ide / helios

Simulating far UVC for pathogen control

https://mrc-ide.github.io/helios/

Other

5 stars 0 forks source link

Running simulations using a table/dataframe of parameter values #80

Closed tbreweric closed 5 months ago

tbreweric commented 5 months ago

In this PR I've created a function called run_simulations_from_table() (model.R) which allows us to collate a series of simulations into a dataframe, where each column is a parameter we want to vary from the get_parameters() default and each row contains the values for a single simulation, and apply run_simulation() to each one.

The current implementation is a little crude - it creates a default parameters_list then manually overwrites the parameters of interest using name matching. It stores both the parameters lists and the simulation outputs for each simulation (row of the dataframe) and the user can decide what they want to output (parameters, simulations, or both). The function also appends columns to each dataframe for the parameters that have been varied.

A number of improvements/additions could be made:

Parallelisation - I haven't done this in a while so don't know immediately how I'd do it, but sure it's possible
Post-processing functionality - have a function, or functions, that automatically collate all simulation outputs into a single dataframe (and possibly convert to long form?) for analysis/plotting/etc
Find a way to specify parameters of interest through get_parameters() and set_uvc() - currently we just overwrite the entries for them after a base version has been generated.
Write some decent unit tests that make us confident the simulations run are those we are specifying using the input dataframe

Keen for both of you @athowes and @cwhittaker1000 to have a look / play with this to see where we could improve it / where it breaks and needs fixing!

tbreweric commented 5 months ago

Thanks going through this PR and for the comments @athowes - I'll try and get around to them on the weekend!

athowes commented 5 months ago

I tried running this and got:

> parameter_table <- readr::read_csv("data/parameter_table_example.csv")
Rows: 4 Columns: 9                                                                                                                                                                                                    
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): far_uvc_workplace_coverage_type
dbl (6): human_population, number_initially_exposed, simulation_time, far_uvc_workplace_coverage, far_uvc_workplace_efficacy, far_uvc_workplace_timestep
lgl (2): render_diagnostics, far_uvc_workplace

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> example <- helios::run_simulations_from_table(parameters_table = parameter_table)
Error in sample.int(length(x), size, replace, prob) : 
  invalid 'size' argument

I'm unsure that this is something which is specific to this PR. I'm using the version of the package as of this PR. Can you reproduce this?

Here is the traceback, so it looks like it's something happening within generate_initial_disease_states for at least one of the parameters_list entries:

Error in sample.int(length(x), size, replace, prob) : 
  invalid 'size' argument
6.
sample.int(length(x), size, replace, prob)
5.
sample(x = 1:length(initial_disease_states), size = parameters_list$number_initially_exposed,
replace = FALSE, prob = NULL) at variables.R#139
4.
generate_initial_disease_states(parameters_list = parameters_list) at variables.R#11
3.
create_variables(parameters_list) at model.R#11
2.
run_simulation(parameters_list = parameters_lists[[i]]) at model.R#70
1.
helios::run_simulations_from_table(parameters_table = parameter_table)

athowes commented 5 months ago

Using read.csv rather than read_csv I get:

> example <- helios::run_simulations_from_table(parameters_table = parameter_table)
There are less than or equal to 2 schools. Consider the population size may be too small!
Error in bind_cols(simulation_outputs[[i]], as_tibble(parameters_table[i,  : 
  could not find function "bind_cols"

I think this is because I need to have called library(dplyr). When I do that it works.

> example <- helios::run_simulations_from_table(parameters_table = parameter_table)
There are less than or equal to 2 schools. Consider the population size may be too small!
[1] "1th simulation complete (25% of simulations complete)"
[1] "2th simulation complete (50% of simulations complete)"
[1] "3th simulation complete (75% of simulations complete)"
[1] "4th simulation complete (100% of simulations complete)"

So my suggestion is to (I think) add dplyr to the Imports in the DESCRIPTION file. We may want to be putting dplyr:: also in the code. I don't know the best practise here exactly / have done this in the past by precident without knowing the full theory. Have a look at: https://r-pkgs.org/description.html#sec-description-imports-suggests

Also, suggest changing from 1th, 2th, etc. Can just use "Simulation 1", "Simulation 2", etc. to make it easy (not have to deal with st, nd, rd, th).

athowes commented 5 months ago

Oh also, with the error when using a tibble I don't think we should allow that to happen. So options are:

Fix for tibble input
Convert any tibble input within function
Give warning that it's a tibble and ask user to change it themselves

tbreweric commented 5 months ago

Using read.csv rather than read_csv I get:
> example <- helios::run_simulations_from_table(parameters_table = parameter_table)
There are less than or equal to 2 schools. Consider the population size may be too small!
Error in bind_cols(simulation_outputs[[i]], as_tibble(parameters_table[i,  : 
  could not find function "bind_cols"
I think this is because I need to have called library(dplyr). When I do that it works.
> example <- helios::run_simulations_from_table(parameters_table = parameter_table)
There are less than or equal to 2 schools. Consider the population size may be too small!
[1] "1th simulation complete (25% of simulations complete)"
[1] "2th simulation complete (50% of simulations complete)"
[1] "3th simulation complete (75% of simulations complete)"
[1] "4th simulation complete (100% of simulations complete)"
So my suggestion is to (I think) add dplyr to the Imports in the DESCRIPTION file. We may want to be putting dplyr:: also in the code. I don't know the best practise here exactly / have done this in the past by precident without knowing the full theory. Have a look at: https://r-pkgs.org/description.html#sec-description-imports-suggests

Also, suggest changing from 1th, 2th, etc. Can just use "Simulation 1", "Simulation 2", etc. to make it easy (not have to deal with st, nd, rd, th).

I think I would avoid adding dependencies where we can as it can create issues with updates to packages (happened to me on another project this week)? Agree that we should add package:: wherever necessary!

In response to your suggestions below - all good ones! I think the conversion to data.frame with a warning that you've had to do this might be the way to go - means we try and correct the error but also flag that if there are issues the user may want to do this themselves before feeding the parameter table to the function?

athowes commented 5 months ago

I think I would avoid adding dependencies where we can as it can create issues with updates to packages (happened to me on another project this week)? Agree that we should add package:: wherever necessary!

Agree that there would be downsides to adding to Depends. Mainly I'd just like to avoid a situation where someone calls a function and it gives an error about not finding a certain function, and they don't know what function that package might be from. If there is a way to do this without adding to Depends (is it Imports or Suggests?) then we can do that.

In response to your suggestions below - all good ones! I think the conversion to data.frame with a warning that you've had to do this might be the way to go - means we try and correct the error but also flag that if there are issues the user may want to do this themselves before feeding the parameter table to the function?

Yes I think convert to data.frame internally. To be honest I'm confused about the difference between tibble and data.frame and why one would work and the other not. I think almost not worth warning about this, aside from that if you output the object (parameter_table) that someone would expect it to be in the same format they input it. You could just code this, but I think it's a bit clunky (i.e. returning it as a tibble if they input it as a tibble).

tbreweric commented 5 months ago

I think I would avoid adding dependencies where we can as it can create issues with updates to packages (happened to me on another project this week)? Agree that we should add package:: wherever necessary!

Agree that there would be downsides to adding to Depends. Mainly I'd just like to avoid a situation where someone calls a function and it gives an error about not finding a certain function, and they don't know what function that package might be from. If there is a way to do this without adding to Depends (is it Imports or Suggests?) then we can do that.

In response to your suggestions below - all good ones! I think the conversion to data.frame with a warning that you've had to do this might be the way to go - means we try and correct the error but also flag that if there are issues the user may want to do this themselves before feeding the parameter table to the function?

Yes I think convert to data.frame internally. To be honest I'm confused about the difference between tibble and data.frame and why one would work and the other not. I think almost not worth warning about this, aside from that if you output the object (parameter_table) that someone would expect it to be in the same format they input it. You could just code this, but I think it's a bit clunky (i.e. returning it as a tibble if they input it as a tibble).

Have added dplyr to imports and the dplyr:: to the bind_cols() call