run input csv with saved fitted parameters per region

numalariamodeling / covid-chicago

Simulating and analyzing Covid-19 transmission and hospital trends per region in Illinois.

https://illinoiscovid.org/

Apache License 2.0

8 stars 14 forks source link

run input csv with saved fitted parameters per region #738

Open ManuelaRunge opened 3 years ago

ManuelaRunge commented 3 years ago

needed for running from timestep -and to reduce sample size
also assess consistency of fitted sample parameters per region over time

ManuelaRunge commented 3 years ago

The trace_selection.py followed by simulate_traces.py generates a 'fitted_parameters_besttrace.csv' that includes the top best parameter set for parameters that were varied in a simulation (or fitted_parameters_ntraces27.csv for best x fitting combinations ) as well as sample_parameters_besttrace.csv (or sample_parameters_ntraces27.csv) for combining sample parameters with fitted parameters that can be used as input csv to run a new simulation via runScenarios.py [...] --sample_csv sample_parameters_besttrace.csv. However there are some issues with using this approach on a routinely basis:

it was designed assuming few specific parameters (i.e. just latest transmission parameter) need to be fitted, given a fixed set of sample parameters that already fit the data
it therefore automatically attaches a 'EMS-x' to the parameter names that were identified to vary, hence assumed to be fitted per region (i.e. time_to_criticalbecomes time_to_critical_EMS-2) - which makes using it as input inappropriate as the emodl file only defines time_to_criticaland not time_to_critical_EMS-2.
- one approach would be to simulate regions separately (no change in parameters names needed)
- another approach would be to make all parameters region specific in the emodl generator
When moving to use a input csv per default, the biweekly fitting of the Ki multiplier, or any other scenarios will need to create a linkage between the saved samples_parameter.csv and 'additional varying parameters.csv - for which this script (sample_parameters.py) has been designed, however it is not user-friendly or integrated into the workflow in the current state.

kokbent commented 3 years ago

Re: 1&2 Want to check if I'm interpreting this correctly. From what I understand, when say I fit a baseline locale model of EMS_1, EMS_5 and EMS_11 for 20 samples, the program randomly draws 20 uniform random numbers based on YAML specification for say fraction_dead_change8. In a single trajectory (scen_num in the code), this fraction_dead_change8 is the same for all three regions. After running the program, we run trace_selection and simulate_traces to select best set of parameters (corresponding to best trajectories). However, because the trace selection process is per region basis, scen_num of 1 might be the best for EMS_1 with fraction_dead_change8 of 0.15 but scen_num of 8 with fraction_dead_change8 of 0.14 might be the best for EMS_5. simulate_traces eventually create the "best" parameters with fraction_dead_change8_EMS-1 = 0.15 and fraction_dead_change8_EMS-5 = 0.14.

As a result, we have three best fraction_dead_change8 after trace selection, one for each region. The original model does not allow regional variation in fraction_dead_change8, so the result of trace selection is incompatible with the original model.

Question: Is it a good idea to allow fraction_dead_change8 to vary among the regions? If yes, then original model might need to be rewritten to allow regional variations. If not, then the trace selection process might be incorrect (likelihood should be calculated using all three regions instead of per-region basis).

kokbent commented 3 years ago

Re: 3 Since this is for fitting purpose, should the additional_varying_parameters be sampled from a distribution? Currently the script sample_parameters.py is assuming the additional column is fixed value. Perhaps I'll work on allowing this additional column(s) to be randomly generated.

ManuelaRunge commented 3 years ago

Re 1&2 Yes exactly, that is how it works and why it is in the end 'incompatible' , the simulate_traces.py could be modified to not attach the region suffix EMS-x, i.e. when grp_list ==1 (L171) (running simulation for single region), or could be modified to write out 11 different csv's, maybe using a single region is a good start before scaling up the all regions in one model, since at the end, ideally, both ways should be possible.

Re question: Good point. fraction_dead_change8 as all the other sample parameters in the yaml files is not per se treated as a fitting parameter, and when fitting the transmission rate (ki) multiplier, we fixed all sample parameters to their mean (using --paramdistribution uniform_mean (here)) and only included the whole range in the final simulations. In that final simulation the uncertainty ranges and medians for all regions would correspond to the same sampled parameters.

However, the uncertainty ranges was too large (i.e. 0 -2000 for ICU at some point)
and fitting the exact Ki multiplier that would make the median fit the data became increasingly time-intensive on a weekly basis.

Therefore we use the trace_selection.py not only to fit a specific parameter, but also for thinning the trajectories to select the best n unique parameter sets per region.

All parameter values would still be within a reasonable range as pre-defined in the yaml file, it would be useful to check by how much these vary per region, and whether some of them should be excluded from thetrace_selection.py, which would complicate the combination of parameters to simulate.

This script (extract_sample_param.py) generates histograms from the sample_parameters.csv and could be modified to read in the filtered sample_parameters.py

ManuelaRunge commented 3 years ago

Re 3: yes when using --param_dicthe sample_parameters.py requires a single fixed value, this was intended for changing maybe rollback_multiplier 0.5 to 0.6 when running simulations for specific intervention scenarios. The sample_parameters.py also allows for the additional column to have multiple values, or even multiple columns, which is addressed via the --csv_name_combo and gen_combos(csv_base, csv_add) (here)

Is that what you meant? I think it is going into the direction what we are intending to do in the IEMS project, for the different mitigation scenarios, where currently the ki_mitigation parameter distribution needs to be generated in a separate python script to generate the csv that can be read in into sample_parameters.py. I like the idea of allowing sample_parameters.py to automatically generate the parameter distribution, which would require many more arguments? - and would the additional parameter distribution be attached to the unique set of sample parameters or repeated for each set of sample parameters (both could be desired depending on purpose of the simulation?)

kokbent commented 3 years ago

Great I missed out the --csv_name_combo options.

I was thinking of using YAML as input to specify distribution of additional parameters. I think there's existing framework to translate YAML to sample parameters, so it could be a straightforward modifications.

I agree that completely random or repeated random both are desirable, should be able to implement both with ease...

kokbent commented 3 years ago

Made small changes in PR #748

As mentioned in the PR note, the current examples at README don't seem to work. Created some simplistic examples so it would work for the YAML ones. I will look into the problem with other examples a little more.

ManuelaRunge commented 3 years ago

the other examples are likely outdated, since the runScenarios.py arguments had been updated, and the refered emodl file does not necessarily exist (we previously had hardcoded emodl files for each scenario).

python sample_parameters.py -rl Local -r IL --model locale --experiment_config spatial_EMS_experiment.yaml --emodl_template extendedmodel_EMS.emodl -save sampled_parameters2.csv

could be simplified to python sample_parameters.py --experiment_config spatial_EMS_experiment.yaml -save sampled_parameters2.csv

however I am getting a yamlordereddictloader package errror on my side, will fix that

As a note for example 5: python sample_parameters.py -e "..\experiment_configs\example\example.emodl" -load "..\example\csv_base.csv" -yaml ".\experiment_configs\example\samp_params_combos_example.yaml"

adding the path into the arguments should not be required as also not required in runScenarios.py and should be checked why it is required.

kokbent commented 3 years ago

the other examples are likely outdated, since the runScenarios.py arguments had been updated, and the refered emodl file does not necessarily exist (we previously had hardcoded emodl files for each scenario).

python sample_parameters.py -rl Local -r IL --model locale --experiment_config spatial_EMS_experiment.yaml --emodl_template extendedmodel_EMS.emodl -save sampled_parameters2.csv

could be simplified to python sample_parameters.py --experiment_config spatial_EMS_experiment.yaml -save sampled_parameters2.csv

however I am getting a yamlordereddictloader package errror on my side, will fix that

As a note for example 5: python sample_parameters.py -e "..\experiment_configs\example\example.emodl" -load "..\example\csv_base.csv" -yaml ".\experiment_configs\example\samp_params_combos_example.yaml"

adding the path into the arguments should not be required as also not required in runScenarios.py and should be checked why it is required.

Yeah I remember having to add import yamlordereddictloader somewhere to make it work.

I put all the simplified emodl, csv and yaml into an example folder (to reduce clutter in important folder), that's why path is needed. Depends on how you feel about it, we can remove the example folder approach. For now the additional parameters yaml does not have a specific folder to live in so path is required.

ManuelaRunge commented 3 years ago

as I see! makes sense. for that reason the emodland experiment_config/input_csv folders are ignored with exceptions for the main files, to allow collecting custom input files that are not required for anyone else. The experiment_configwith yamls is not yet ignored, but I would prefer adjusting the gitignore over introducing subfolders, since i.e. yaml and emodl files are copied over or into in some other scripts.

kokbent commented 3 years ago

as I see! makes sense. for that reason the emodland experiment_config/input_csv folders are ignored with exceptions for the main files, to allow collecting custom input files that are not required for anyone else. The experiment_configwith yamls is not yet ignored, but I would prefer adjusting the gitignore over introducing subfolders, since i.e. yaml and emodl files are copied over or into in some other scripts.

Yes, OK, will have a look and see how to consolidate all the examples.

ManuelaRunge commented 3 years ago

to follow up on this, could python sample_parameters.py -e "..\experiment_configs\snippets\example.emodl" -load "..\snippets\csv_base.csv" -yaml ".\experiment_configs\snippets\samp_params_combos_example.yaml"

become python sample_parameters.py -e "example.emodl" -load "csv_base.csv" -yaml "samp_params_combos_example.yaml" with moving the files to the respective default folders (emodl, experiment_config, and input_csv) ?

and to introduce new python model will requrie modifying the virtual environment used on quest, as well as requirements for local set up. Would it be sufficient to raise a simple ValueError instead of a warning here
warnings.warn(parameter + ': List length different from replicate_number and factorial_after is not True.') warnings.warn("Parameter " + parameter + " skipped: don't know how to sample this parameter.")

The yamlordereddictloadermodule has often caused installation issues among users (see related issue#716), therefore I would suggest to use try except or just load yaml without specifying a Loader (in future we might want to switch to a new yaml loader, but would also require modyifing quest python environment, so no priority there)

And once this is working (seems it does already, I got the sampled_parameter.csv) - it would be useful to have this apply for the covid chicago setup, i.e. save fitted_sample_parameters_region_X.csv per region from the last fitting iteration and setup a additional yaml (or possible to reuse the spatial yaml?) to make modifications to the csv for running new simulations, and have batch files to facilitate automation is possible. I can also have a go at it! (i.e. add script for extracting region specific sample parameter csvs, and then only left is the batch files with the appropriate file specification (?)). Does that make sense? (not sure if we still want to fully integrate this into the current fitting-simulation workflow, but that would be the remaining steps)

ManuelaRunge commented 3 years ago

I think this would do

(example for region 11)

        for e, grp in enumerate(grp_list):
            grp_nr = grp_numbers[e]
            df_samples = pd.read_csv(os.path.join(output_path, 'sampled_parameters.csv'))
            rank_export_df = pd.read_csv(os.path.join(output_path, f'traces_ranked_region_{str(grp_nr)}.csv'))
            rank_export_df_sub = rank_export_df[0:n_traces_to_keep]
            df_samples = df_samples[df_samples['sample_num'].isin(rank_export_df_sub.sample_num.unique())]

            #FIXME list of regions to drop hardcoded, also EMS_1 vs EMS_11
            cols_to_drop =[]
            for ems in ['EMS_2' ,'EMS_3' ,'EMS_4' ,'EMS_5' ,'EMS_6' ,'EMS_7' ,'EMS_8','EMS_9','EMS_10' ]:
                cols_to_drop = cols_to_drop +  [i for i in df_samples.columns if ems in (i)]

            df_samples = df_samples.drop(cols_to_drop, axis=1)
            df_samples['scen_num_orig'] = df_samples['scen_num']
            df_samples['scen_num'] = range(0,len(df_samples['scen_num']))
            df_samples.to_csv(os.path.join(output_path, f'sample_parameters_region_{str(grp_nr)}_{n_traces_to_keep}.csv'), index=False)

adapted from functions in simulate_traces.py, however the setup/purpose is slightly different, so would add it via new python file. Testing now whether it results in the same output as the 'original' one

Follow up sim can run via python runScenarios.py -sr EMS_11 -csv sample_parameters_region_11_100.csv -n "n100bestfitsamples" --scenario bvariant_vaccine

Adding a comment: when running from input csv (that is already fitted) the trace_selection step needs to be removed from the postprocessing.sh files)

kokbent commented 3 years ago

Incorporated your suggestions and merged my PR. Still have to deal with yamlordereddictloader issues, and the integration to covid chicago setup, more tinkering is on the way! 😃