Track build options alongside input data

@metasoarous and I have been discussing a better way to track and define a particular run of CFT.

Some goals:

Run on a subset of samples, seeds, etc from a given dataset (currently datasets are described by partis info.yaml files and in general cft is run on all samples, seeds described by such a file)
Run the same (subset of a) dataset twice with different processing options for comparison and have this be transparently reflected in output
Make overwriting the existing output more difficult

One solution we discussed so far is having a script which takes a partis info.yaml, along with all build options (options specific to what and how we want to build things from the dataset) including subset information (e.g. we only care about one seed) and outputs a modified version of the info.yaml file with keys specifying our options, and only containing the subset of data for which we would like to run the pipeline.

It seems like there is no perfect way to contain all info about a build in the output naming (regardless of nesting dirs or concatenating names) so we will need to decide which information is most important to reflect in output, and the rest will have to be summarized using some label option passed.

This means overwriting output is certainly still possible (even though scons sometimes saves us from this). If we are still concerned about this, I think it's worth considering a separate issue for a feature which checks the outdir being used to see if it exists already and requires the use of an --overwrite flag before proceeding.

This seems like it would require some significant changes to the code base in terms of how options are parsed, and is worth discussing, so please weigh in if you like while I tackle some more pressing changes (see https://github.com/matsengrp/cft/projects/1)

Thanks for putting this up @eharkins.

It seems like there is no perfect way to contain all info about a build in the output naming (regardless of nesting dirs or concatenating names) so we will need to decide which information is most important to reflect in output, and the rest will have to be summarized using some label option passed.

I think the only thing we want to automatically have reflected in the output paths is whether or not its a test run. This makes it quick and easy to run --test when debugging without having to worry about accidentally overwriting something important (since --test triggers running of a small subset of the data, perfect for sanity checking modified pipeline code).

For virtually any other situation, I think the pipeline runner should edit in their own dataset_id in the yaml file if they want a separate output path (or as you suggest, customize dataset_id via the dataset "army knife" script). As you point out, there's just too many potential variations here in how folks might be subsetting/configuring to do this automatically, and users will have better context for how to craft naming conventions based on their individual needs.

There is the question of tracking any top level options about how the dataset was run. We should probably include these in the CFT output, and find a way to display this information (along with the scons command options).

Thanks again

matsengrp / cft

Track build options alongside input data #288