hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Add `cache_root_dir` attribute #214

Closed jburos closed 7 years ago

jburos commented 7 years ago

When analyzing multiple cohorts in the context of a single paper, we often want to be able to group the cache-dir for each cohort so that they can be copied, deleted, or referenced collectively.

Some use cases include:

Rationale for cache_root_dir

The simplest version of this functionality (until a more robust cache-file management solution is implemented - see #175) is currently to include an optional cache_root_dir attribute in the Cohort object.

The idea is that related Cohort objects would store their caches within the same cache_root_dir.

note: This is optional in that cohorts created without cache_root_dir, or that are initiated with an absolute path for the cache_dir should _completely ignore the input for cache_root_dir.

Specifically, in both scenarios:

Also: standard cache-dir naming

As a secondary feature, it would be helpful to be able to name the cache-dirs for each cohort according to a standard protocol.

This would either support

The second feature is implemented by the optional cache_dir_kwargs parameter

For example, say we have a function that returns a TCGA Cohort containing TCGA data, where the composition of the cohort depends on an input attribute "tumor_type". In this case, we might want a different cache-dir for each tumor-type.

tavinathanson commented 7 years ago

Given that the Cohort object is the highest level that cohorts knows about, I feel like we'd want to introduce knowledge of multiple cohorts into the library before adding cache_root_dir?

tavinathanson commented 7 years ago

I see that you already have a PR; suppose it's reasonable to start with that and go from there 👍

jburos commented 7 years ago

yeah - ok. Am open to other implementations, but also wanted to write down thought process behind that PR for now