Add `cache_root_dir` attribute

jburos commented 7 years ago

When analyzing multiple cohorts in the context of a single paper, we often want to be able to group the cache-dir for each cohort so that they can be copied, deleted, or referenced collectively.

Some use cases include:

if we are versioning the caches for a given analysis. e.g. we might have one cache with & one without certain settings (e.g. BQSR) for somatic mutation calling.
or, if we want to "wipe" or force-refresh the cache(s) for particular analysis.

Rationale for cache_root_dir

The simplest version of this functionality (until a more robust cache-file management solution is implemented - see #175) is currently to include an optional cache_root_dir attribute in the Cohort object.

The idea is that related Cohort objects would store their caches within the same cache_root_dir.

note: This is optional in that cohorts created without cache_root_dir, or that are initiated with an absolute path for the cache_dir should _completely ignore the input for cache_root_dir.

Specifically, in both scenarios:

cache_door_dir attribute set to None, and
cache_dir set to the value of cache_dir used to create the Cohort

Also: standard cache-dir naming

As a secondary feature, it would be helpful to be able to name the cache-dirs for each cohort according to a standard protocol.

This would either support

a consistent naming for each cohort's cache_dir within the cache_root_dir
and/or allow for the cache-dir to depend on input parameters.

The second feature is implemented by the optional cache_dir_kwargs parameter

For example, say we have a function that returns a TCGA Cohort containing TCGA data, where the composition of the cohort depends on an input attribute "tumor_type". In this case, we might want a different cache-dir for each tumor-type.

tavinathanson commented 7 years ago

Given that the Cohort object is the highest level that cohorts knows about, I feel like we'd want to introduce knowledge of multiple cohorts into the library before adding cache_root_dir?

tavinathanson commented 7 years ago

I see that you already have a PR; suppose it's reasonable to start with that and go from there 👍

jburos commented 7 years ago

yeah - ok. Am open to other implementations, but also wanted to write down thought process behind that PR for now

hammerlab / cohorts

Add `cache_root_dir` attribute #214

Rationale for cache_root_dir

Also: standard cache-dir naming