Multi-sample API - Githubissues

tavinathanson commented 7 years ago

We haven't necessarily wanted to incorporate multiple samples into one Cohort object, as that complicates every part of the library. For example:

Which samples does missense_snv_count count?
How would we refer to the tumor_sample and normal_sample?

For many use cases, different Cohort objects can just be created with different sets of samples.

However, questions pop up like:

How can we make it easier to iterate through all samples?
How can we take advantage of Cohorts/Discohorts when we don't have clinical data?

One thought is a separate SampleCollection for specifying a bunch of samples and optionally creating Cohorts from those samples. And perhaps a SampleGroup (where Patient can extend SampleGroup) to link samples together when we don't have the clinical data appropriate for a Patient. Something like:

group = SampleGroup(id="1033")
sample_1 = Sample(group=group, label="pre", bam_path_dna=...)
sample_2 = Sample(group=group, label="post", bam_path_dna=...)
samples = SampleCollection([sample_1, sample_2])
samples.run() # Use Discohorts to run over the samples rather than patients
# TODO: If running, Epidisco, how do we know which is tumor/normal/RNA? 
# Cohort objects give us that, which is how Discohorts currently does it.
cohort = samples.as_cohort() # Works better if `group` is a `Patient`

jburos commented 7 years ago

I like the idea of a sample-group, though I don't know how well that extends to @julia326 's use case(s).

It might be helpful here to document the analysis that one would want to enable when there are multiple samples. This might help to motivate the API for using multi-sample data.

E.g. : some change in status, comparing pre-tx vs post-tx?

Could there be different settings on the samples (e.g. bqsr vs not)?

In each of these cases, I can imagine one sample might be the "default" (pre & with-bqsr) and another might be referenced on-demand.

tavinathanson commented 7 years ago

@jburos definitely! I suppose my motivating "analysis" to start with was running Epidisco/other pipelines for all samples.

tavinathanson commented 7 years ago

Will also be useful to talk to @julia326 about what types of analyses we could enable here.

jburos commented 7 years ago

@tavinathanson great, i can understand wanting to run the pipelines, but then .. do what with the results?

I am bringing this up again b/c we're approaching this problem from the other end, so to speak, for a different project. For this cohort, we have a subset of patients with pre/post Tx RNA samples & already have epidisco pipeline results for these samples. I am now thinking about how to extend cohorts in order to process them.

In my use case (granted parts of this aren't yet supported by cohorts, but .. putting here for the record), I'd like to be able to:

run an command using "pre-tx" samples, "post-tx" samples, whichever is available, or the difference between the two timepoints.
- for something like a differential expression analysis, then the interpretation of the above should be clear
- for expressed mutation count, I might want to restrict to pre-tx samples, restrict to post-tx samples, or look at number of mutations that went from expressed to non-expressed or vice-versa.

Seems to me that a lot of the above could be facilitated with a sample label or keyword. Again, not thinking here about discohorts, just cohorts.

hammerlab / cohorts

Multi-sample API #211