hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Resolve overlap between epidisco and cohorts #154

Open tavinathanson opened 7 years ago

tavinathanson commented 7 years ago

From @tavinathanson:

I'm fairly unclear right now about how best, when using cohorts, to use epidisco results; for example using epidisco neoepitopes rather than https://github.com/hammerlab/cohorts/blob/master/cohorts/cohort.py#L767

One obstacle is that the results would be different!

I'm open to the idea of removing that functionality from cohorts, which would also simplify cohorts and better clarify what it is and isn't. As a counterpoint, I think @arahuja mentioned a while ago that this part of cohorts might be useful even for the non-epidisco use case.

From @tavinathanson on some other thread:

Sounds like this thread is addressing two separate issues (possibly my fault for hijacking)?

Where to put the source files for cohorts, e.g. VCFs? What to do about overlapping functionality between cohorts and epidisco, e.g. neoepitope output?

From @tavinathanson:

Related to 2: what are all the outputs that epidisco gives us? How easy is it to get all predicted neoepitopes vs. only expressed neoepitopes? I presume that it doesn't output the varcode.Effects, and we'd still need to do that in cohorts when e.g. counting missense mutations?

From @jburos:

@tavinathanson I agree with you - there are two separate issues here.

  1. Create a Cohort from the epidisco files , which has several sub-tasks #16
    • how to store/share source files for Cohorts, etc.
  2. resolve overlap between functionality of Cohorts & epidisco (e.g. as you point out, neoantigens).
    • seems like a Cohorts-related issue (feature-request: enable import of predicted neoantigens from epidisco). Would be curious to know how different these neoepitope prediction pipelines are.

From @tavinathanson:

@jburos also re 2, I worry about versioning conflicts if, for example, we did varcode Effect prediction within cohorts if that doesn't pop out of epidisco, and use epidisco for neoepitope prediction with a different version of varcode.

tavinathanson commented 7 years ago

Also related: tracking provenance. cohorts tracks Python package versions in PROVENANCE files inside the cache directory. I'm not sure how to pull out the software versions that epidisco used?

armish commented 7 years ago

@tavinathanson: definitely huge overlap between epidisco and cohorts. epidisco does neoepitope prediction, but the versioning conflicts you mentioned might become an issue.

When we were discussing this issue during the hacktathon, we just thought that it would be great to leave the heavy-lifting to epidisco and have the option to easily build a cohorts analysis on top of it. Re-mapped BAM files, HLA typing results and VCF files can all be consumed by cohorts if we simply mount the NFS that biokepi writes the results into; but I don't think the neoepitope predictions epidisco spits out will be that useful for the RCC project overall — given the iterative, explorative nature of these checkpoint studies.

arahuja commented 7 years ago

When we were discussing this issue during the hacktathon, we just thought that it would be great to leave the heavy-lifting to epidisco and have the option to easily build a cohorts analysis on top of it. Re-mapped BAM files, HLA typing results and VCF files

I'd agree with this - get the large compute portions from epidisco (alignment, VCF, etc) and use those as inputs and let cohorts continue as-is to create the effects and annotations. This will make it easier to re-rerun effect annotation when bugs arise, or re-run isovar with different parameter settings etc.

Probably a separate issue to biokepi will be how to handle dependencies not explicitly specified. An example is, vaxrank version was recently bumped to 0.2.5, which requires varcode>=0.5.1. However, varcode=0.5.8 has a particular bug fix in it, what's the best way to specify this?

tavinathanson commented 7 years ago

Certainly seems like that's the easiest thing for now: namely using epidisco for everything that cohorts doesn't do.

tavinathanson commented 7 years ago

Recent thoughts from @hammer (correct me if I'm paraphrasing incorrectly): epidisco for anything often generated, cohorts for anything exploratory.

Our current strategy doesn't quite fall into that description, since we're currently doing e.g. neoantigen calling in cohorts. Our current reasoning for doing that in cohorts is to be able to look at various intermediates that I don't believe are easily accessible from epidisco in its current form.

hammer commented 7 years ago

epidisco for anything automatically generated

tavinathanson commented 7 years ago

From @hammer:

Cohorts and Epidisco share some functionality. It would be nice to centralize the discussion of how we'll ensure they generate consistent results and ultimately separate concerns.

Currently my biggest concern is that Epidisco uses vaxrank but cohorts uses topiary. hammerlab/vaxrank#31 might solve this issue.