Permuting reference input sets; Tracking data history

We need more robust methods for re-using reference input sets, that lets us: a) specify permutations of inputs for exploring a wider space b) not duplicate data on disk c) allows clean and compact diffs d) readily deployable e) track development, history, and and stakeholder approval process f) easily enables derivative work

This issues of permutation and history tracking may have distinct solutions, but I am wondering if we could design a way to use git and data organizational conventions to accomplish all of these.

Matthias has many constant use cases for a-c.

Sergio and his team are actively working through d as they compile data for Switch-Mexico. They have a chance to do it well, and could use some help in figuring out how to navigate tools. They are using google drive, I suggested moving to git (and github if their repositories don't have a size restriction).

I've had separate conversations with Mark and Ana about these issues lately.

That's it for now. I wanted to start a thread on this topic before leaving on vacation for the week.

-Josiah

I have some suggestions for c)-f) (no miracles, but fairly workable), but they'll take a little while to writeup.

My solution to a) so far has run counter to b). i.e., I have an easy way to create different inputs directories for each permutation, but that creates a lot of data (e.g., four full inputs directories to study two different price trajectories and two different demand response trajectories). I've been thinking about three solutions to this, and I think I prefer the third. Here they are:

Have a master "inputs" directory, and within that have multiple inputs subdirectories (probably only one level). An inputs subdirectory can be chosen via an --inputs_subdir argument. Within each inputs subdirectory there can be a subset of the normal data files. If a file is missing, it will look for it in the main inputs directory. There are a couple of potential problems with this -- proliferation of directories (need one dir for every permutation, but at least it's lightly filled; unless you start permuting the capacity factors); and what to do if an inputs subdir is meant to omit a file that exists in the parent (could probably handle this by putting in an empty file).
Make one full directory or subdirectory for each permutation, but define a special data file format that acts as a link to other files, rather than containing data itself. e.g., if variable_capacity_factors.tab contains '\n../variable_capacity_factors_sunny.tab', then the corresponding file will be read instead. Something similar could be achieved with real symlinks, but this avoids OS-specific challenges in creating symlinks, and it avoids the problem of Dropbox converting symlinks to real copies when you sync between machines. However, it still gives rise to directory proliferation, and we'd probably have to treat "/" and "\" as equivalent in the link specification, to avoid cross-platform problems.
Allow specification of aliases for standard data files on the command line, with something like "--data-alias variable_capacity_factors.tab=variable_capacity_factors_sunny.tab". In this case, we could keep all our main data files and alternative sensitivity versions in the same directory, and then just construct permutations as needed via command line arguments. This would prevent proliferation of directories and fits very well with the solve-scenarios system, where each scenario is defined entirely by command line arguments. In this particular case, instead of using a --inputs or --inputs-subdir argument to choose a different inputs directory, you would just use --data-alias to choose a different version of one (or more) input files. (A couple of lines in the load_data code should be able to handle resolving these aliases.)

Now that I think more about implementing this, I'm leaning more toward option 1: allow subdirectories within each inputs directory, each of which contains a subset of the files needed for a model run. Then load_aug() will check the main directory if a required file doesn't exist in the subdirectory. I may do this by defining a model.open_input_file(tab_file_name) method, which can be used instead of open(os.path.join(inputs_dir, tab_file_name)). Then the same logic can be used by modules that directly open their own input files.

Disadvantages of the inputs subdir approach:

a little inefficient if you want to do a lot of permutations, because you have to create one subdir for each permutation (the aliasing approach would let you mix and match individual data files as needed to create each permutation, e.g., different fuel price scenarios with different EV adoption scenarios)

Advantages:

maintains consistency between different files that may all be affected by the same model setup parameter (e.g., a filter on the available technologies)
users don't have to keep track of all the individual files that are changed by one model setup parameter and specify aliases for each one when setting up a scenario
allows me to write model setup code that automatically creates a subdir with all the files affected by (an) alternative model setup parameter(s)
maps pretty well onto the way we actually run switch, i.e., we normally run scenarios, not permutations, and this puts all the data needed for one scenario in one place
fairly efficient data storage, because the big data files (loads and capacity factors) will normally only be in the main directory, not the scenario subdirectories

So your basic pattern is specify one reference scenario, then allow permutations to selectively replace individual input files. Of your choices, I would favor 1 or 3, depending on use cases. For 3, I generally prefer storing a long series of arguments in a config file rather than passing them through the command line (easier record keeping), but that's an implementation detail. The main differences between 1 & 3 is whether you look for the list of diffs in a directory or in an config file/command line arguments.

Option 1 would probably be a bit easier for casual users to dig through .. both options require you to open a file browser, but option 3 also requires you to open a text document. Option 3 is a bit more optimized for computers, allowing comprehensive permutations without data duplication or needing to wait for OS reads of directory contents (which can add up to a significant overhead for network drives when computations are relatively fast and disk scans are frequent).

Another option I was starting to think through is to make data repositories in git, and use branches and/or github forks to track permutations? This seems like a good strategy for most issues, but could be harder for casual users to grasp. A static set of folders & files could be easier for casual users to deal with than git branches.

Different use cases may require different strategies, but I wanted to explore if git repos & branches might work for everything.

I was leaning toward option 3 originally, but when I contemplated implementing it, I got scared off.

Take an easy example -- suppose I want to run two different scenarios which use two different fuel cost series. In my back-end database, I have a fuel_costs table with a column showing fuel_scen_id. In my model setup script, I can specify a fuel_scen_id as an argument, and the extraction query(ies) use that to pull out the right fuel cost series.

If I use option 1, I can make one pass through all the queries with my default fuel_scen_id, and dump those files into the main directory. Then I can make another pass with (only) a new scenario name and fuel_scen_id. In this pass, any queries that are affected by the fuel_scen_id argument will automatically dump their output into a subdir matching the scenario name (e.g., "high_fuel_costs").

If I use option 3, the user would need to give a name to each alternative version of each data setup parameter (e.g., "high") and then the model setup script would have to munge that into the file name ("fuel_costs_high.tab"). Then the user would need to identify all the file(s) that are affected by this change, and refer to the alternative version when they setup the scenario (e.g., "--alias fuel_costs.tab=fuel_costs_high.tab"). To make this work, the user will need to look through the data extraction module or the inputs directory to find out which .tab files are affected by each argument. This is probably workable in this case, but it's messy. It will be harder to use and error-prone if one parameter affects multiple files, e.g., a model setup parameter that excludes certain technologies.

So I'm leaning toward option 1 basically because it creates a more natural pipeline for scenario definition: the user can say in their model setup script, "I want a scenario called 'high_high' with fuel_scen_id='high' and ev_scen_id='high'". Then they can run that scenario by saying "switch solve --inputs-subdir high_high" instead of "switch solve --alias fuel_costs.tab=fuel_costs_high_fuel_cost.tab --alias ev_adoption.tab=ev_adoption_high_ev_adoption.tab".

For most of my use cases, I don't think that git branches and forks would work very well for permuting the data files. Usually I use different permutations to analyze different policies or risks as part of a single study, i.e., I would usually want to have both datasets on disk at the same time, so I can compare the data files, present them as a coherent set of scenarios, run the scenarios in parallel, etc.

I would use commits and tags to represent different versions of the same basic study (e.g., if I change my solar dataset and begin doing new studies with that). If I had multiple qualitatively different studies that shared the same code, I might use forks or branches for that. But more likely I'd just promote the shared code up to the regional code repository (switch_mod.hawaii) and maintain separate repositories for the separate datasets.

By the way, this is the general file structure I have been moving towards:

switch-model (organization)
    switch/ (repo)
        - could be anywhere; shared across all models
        - sometimes branched to test new features
        - installed via "python setup.py develop"
        - run via "switch solve ..." or "switch solve-scenarios ..."

Switch-Hawaii/ (organization)
    database/ (repo)
        - complete code and data used to create the back-end database 
        - (still patchy, but getting there)
        - someday this may store the data as small, discrete files
        - instead of in a database, so then we could distribute the
        - back-end "database" via github too. But at the moment I'm
        - leaning away from that, because a database is a nice place
        - to hold multiple versions of the same data, and the code to
        - split and merge so many tables as needed could get ugly.
        - (basically reinventing a database in the file system, to enable
        - distribution via github).
    main/ (repo)
        - used for my mainline model runs
        inputs/
            ev_slow/    (alternative scenario)
            ev_full/
        inputs_tiny/    (used for testing)
        options.txt
        modules.txt
        scenarios.txt
        get_scenario_data.py 
        - get_scenario_data.py is a lightweight script which calls 
        - switch_mod.hawaii.scenario_data.write_tables() with appropriate arguments
        - to construct the inputs/*/*.tab files
    pha2/ (repo)
        - used for risk-oriented modeling.
        - actually has one set of data files for a short-term hedged model 
        - and another set for a long-term progressive hedging model,
        - but they're both in the same repository because they share some code
        - (that may get moved up to switch/switch_mod/hawaii, then I could 
        - split this into two repos)
        inputs/         (30-year pha model)
        inputs_tiny/    (for testing)
        inputs_short/   (hedged, 10-year model)
        get_scenario_data.py 
        get_pha_data_monthly_logistic.py (generate multiple fuel price trajectories)
        psip.py (code used to model the Power Supply Improvement Plan)
        lng_conversion.py (logic for conversion to LNG)
        - note: psip.py and lng_conversion.py will get moved to switch/switch_mod/hawaii soon)
        options.txt
        modules.txt
        scenarios.txt
        options_pha.txt
        modules_pha.txt
        scenarios_pha.txt
    ge_validation/ (repo)
        - each inputs directory corresponds to one scenario previously run in GE MAPS
        source_data/
            - holds all files used to construct datasets for this model
            - (this one does not use our back-end database)
        inputs_01/
        ...
        inputs_09/
        build_scenario_data.py  (creates inputs*/*.tab from source_data/)
        options.txt
        modules.txt
        scenarios.txt (e.g., "--scenario-name scen01 --inputs-dir inputs_01")
    scuc/ (repo)
        3bus.xlsx
        IEEE RTS 1999 case 1 data.xlsx
        inputs/
        inputs_3bus/
        - note: data are copied and pasted from the xlsx files to the inputs/*.tab files
        options.txt
        modules.txt
        iterate.txt
        trans_branch_flow.py (will eventually be promoted to switch/switch_mod)
    demand_system/
        - study of a nonlinear demand response system
        - (nested constant elasticity of substitution economic model)
        inputs_2007_15/     (data for "backcast" model of 2007)
        inputs_2045_15/     (data for forward-looking model of 2045)
        iterate.txt         (uses hawaii.demand_response)
        modules.txt         (loads hawaii.demand_response and hawaii.r_demand_system)
        scenarios.txt
        options.txt
        - specifies --dr-demand-module hawaii.r_demand_system --dr-r-script nestedcespy.R
        get_scenario_data.py
        nestedcespy.R       (R code to implement the demand system)

In this setup, there is one repository for each different category of study that I do. Each repository either holds a complete set of source data (often in Excel files) and code to make that into .tab files, or it holds a lightweight script (get_scenario_data.py) which passes arguments to a shared script (switch_mod.hawaii.scenario_data.write_tables()) which creates all the .tab files by extracting data from our back-end database. This allows me to have multiple studies going at once, which aren't really related to each other. Even the studies that draw on the same back-end database just use a lightweight script to say which data they want, so there's no real reason to create these as branches or forks of some "standard" study repository.

Managing each of these with git/github enables goals c, d and e. There is a straightforward evolution of the data used for a particular study, tracked in git. And to run a particular study, people just need to install switch, clone a study repository (possibly a particular commit/tag/release of that repository), cd into it and then run "switch solve" or "switch solve-scenarios".

I'm not sure about goal f). I haven't found that I need to do a lot of derivative work based on individual studies. It's more like my back-end database, data extraction scripts and main repository evolve in tandem, and then it's pretty easy to tweak the other study repositories to use the revised dataset (since there's not much code in each repository). But I suppose if you wanted to start with the "main" or "pha" repository and tweak a few parameters to make a derived model, that would be easy to do by branching or forking the repository. That could even allow a combination of automated .tab file creation (as I use) with manual changes (as an interested outside party might do.

As I said, there are no miracles here, but it seems to work well enough for me.

Regarding ease of implementation and use of option 1 vs 3, you could make option 3 about as easy as option 1. This is under the assumption that command-line parameters can be stored in a text file, which I assume is straightforward.

In both cases each individual scenario is defined in terms of a reference dataset and a set of data diffs. The reference dataset will probably be stored in a single directory for simplicity. The set of data diffs will be a set of paths that resolves to the diff files. fuel_costs.tab: high_scenario/fuel_costs.tab or fuel_costs_high.tab You could find these paths by scanning a single directory or by reading a text file. Each fuel cost scenario in the database needs to have a name and id field, so unique path names for files or directories can be auto-generated by concatenating the name and id with the file base name.

For fastest performance on random file systems of random clusters, you'll probably want to have a text file that specifies the paths to each scenario's diff files. A few years ago, I ran into significant and persistent disk lag during the secondary production-cost simulation on a UC Berkeley EECS cluster when scanning a directory of that held ~365 folders. My solution was to tweak my database export script to write a text files of the each relevant paths while it was creating the directories. If option 1 includes that performance detail, it starts to resemble option 3 with the added convention of 1 set of diffs per subdirectory.

In your example, the user would execute the scenario like with: "switch solve --scenario high_high", and switch would look for the appropriate aliases in the scenarios.cfg file; failing to find descriptions in the text file, it could look for a subdirectory named high_high before giving up.

Yeah, the need to keep multiple semi-human-readable scenarios on disk at the same time is pretty crucial. If we have a git repository for the compiled data and branches for each scenario, then setting up the local runtime data directories could entail:

Download the compiled reference scenario from main branch (do not clone, just download a snapshot)
Download the diffs between the reference branch and a scenario branch
Repeat for each target scenario

This is a different process than just cloning the entire repository and relevant branches. If some script is automating the grunt-work, then using git vs direct database export to set up input data doesn't seem like a bit deal. The next question is then, does git offer enough functionality to bother using it to package data as well as track changes & authorship.

I gotta run, but I'll reply to your file structure thread soon.

Maybe we should move these kinds of discussions to our google groups? It has better features to track dialogue. https://groups.google.com/forum/#!forum/switch-model

switch-model / switch

Permuting reference input sets; Tracking data history #27