Closed jburos closed 7 years ago
Given that the Cohort
object is the highest level that cohorts
knows about, I feel like we'd want to introduce knowledge of multiple cohorts into the library before adding cache_root_dir
?
I see that you already have a PR; suppose it's reasonable to start with that and go from there 👍
yeah - ok. Am open to other implementations, but also wanted to write down thought process behind that PR for now
When analyzing multiple cohorts in the context of a single paper, we often want to be able to group the cache-dir for each cohort so that they can be copied, deleted, or referenced collectively.
Some use cases include:
Rationale for cache_root_dir
The simplest version of this functionality (until a more robust cache-file management solution is implemented - see #175) is currently to include an optional
cache_root_dir
attribute in the Cohort object.The idea is that related Cohort objects would store their caches within the same
cache_root_dir
.note: This is optional in that cohorts created without
cache_root_dir
, or that are initiated with an absolute path for thecache_dir
should _completely ignore the input forcache_root_dir
.Specifically, in both scenarios:
cache_door_dir
attribute set to None, andcache_dir
set to the value of cache_dir used to create the CohortAlso: standard cache-dir naming
As a secondary feature, it would be helpful to be able to name the cache-dirs for each cohort according to a standard protocol.
This would either support
cache_root_dir
The second feature is implemented by the optional
cache_dir_kwargs
parameterFor example, say we have a function that returns a TCGA
Cohort
containing TCGA data, where the composition of the cohort depends on an input attribute "tumor_type". In this case, we might want a different cache-dir for each tumor-type.