quaquel / EMAworkbench

workbench for performing exploratory modeling and analysis
BSD 3-Clause "New" or "Revised" License
126 stars 91 forks source link

Potential contributions to the Workbench #101

Open steipatr opened 3 years ago

steipatr commented 3 years ago

Hi Jan,

Jason and I have now completed our project for the Energy Modelling Initiative. We used the Workbench extensively, but also added some stuff to it. Looking at our code, we've identified a few things that might be interesting to incorporate into the Workbench. If you could take a look and let us know which fit your vision and ideas, then we will submit PRs (and separate issues?) for those.

1. Saving

We developed a standardized file naming scheme "date_samplingmethod_numberofruns_numberofoutcomes" (e.g. 2021-02-21T14-29_lhs_600000x34.tar.gz) that we used together with save_results. We found that useful to have standardized names for the different data sets we had generated. This could be added to EMA as a function make_tar_name in utilities.py and used something like

results = perform_experiments(model, n_experiments = 1000)
save_results(results, make_tar_name())

or maybe even as the default in save_results if no file_name string is given. Potential criticism: are file names the right place to store metadata?

2. Experiment stats

We added some code to perform_experiments that printed various stats once the experiments had completed, e.g.

Run stats

Cores: 8 Runs: 370000 Elapsed time: 11:35:44.293863 time / runs: 0.113 seconds time / (runs / cores): 0.903 seconds

This was useful because it allowed us to compare different cluster configurations and get an idea of how long future experiments might take. This could be added to perform_experiments. We just had it as a print statement, but it could probably integrate with ema_logging.

3. Outcomes dict to dataframe

For analysis, we found it convenient to convert the outcomes dict to a Pandas dataframe. There's a question of dimensionality here because EMA outcomes can have different dimensions based on the type of model and outcomes of interest, but it might be useful to explore this. It could be added to EMA as a function outcomes_to_df in utilities.py. @jasonrwang can probably say more about this if necessary.

4. Splitting results

Since some parameter combinations caused integration errors with our model, we had to parse the results after the fact and identify/remove runs with integration errors. For this, we wrote a small utility called split_results that would take one results object and split it into two objects based on e.g. a dict with keys = ("A", "B") and values containing lists of run numbers, or a list containing a set of run numbers. This is basically the reverse operation of EMA's merge_results, and could also be added to utilities.py.

5. External parameters file

This is more a conceptual thing, but we found it super useful to define our parameters in an external .py file and then import them into the notebook or script for experiments or analysis. Saved us a lot of copy-pasting and made version management of parameter ranges way easier. We're not sure if this is really something that could be "added" to EMA since it's already doable, just thought we would share it here in case it sparks ideas somehow.

Happy to hear whether any of this would be useful within EMA, in the presented or a modified form. We can also share our codebase with you if that's useful. Let us know, happy to contribute where possible.

quaquel commented 3 years ago

I like these various ideas. A few quick reactions

  1. metadata is also stored in json file within the tarbal. Presently, it does not contain the exact same info as you are putting in the file name. See utilities.py, lines 231-232 for further details. Part of the problem here is that it is not easy to extract a lot of this metadata from the results returned by perform_experiments. E.g. the sampler that is used, or the number of experiments are not explicitly logged within either the experiments or outcomes dict.
  2. Good idea. Are you using the logging system for this? Related also, I have been considering adding a progress bar using tqdm instead of the current progress log messages.
  3. I recently added a to_dict myself to OutcomesDict exactly for this reason. I however, deliberately did not add the to_dataframe, because that is valid only if your outcomes are scalars. More broadly, I am still doubting the move to OutcomesDict and wondering whether an alternative approach to persistence might be preferable. What were your experiences in using OutcomesDict?
  4. I like the idea. This seems similar to the group_by operations that exist for most of the plotting function. So perhaps a lot of the code which handels group_by operations can be moved from analysis to utilities. See group_results in plotting_util.py
  5. Again a good idea. There have been some attempts at doing this using e.g. Excel. I can also imagine using json or indeed .py files for this. In any case it would be good to be able to go both ways, so something like

uncertainties = parameters_from_json(filename)
parameters_to_json(uncertainties, filename)

and equivalent functions for other parameter file formats we might want to support.

steipatr commented 3 years ago

OK, I will take a stab at 1 and 2 first. Would you prefer separate issues to discuss in more detail (and close this one)?

quaquel commented 3 years ago

probably better of as separate issues.

might also be good to move these five ideas into the TODO file (which I can do)