Open jburos opened 7 years ago
Thanks @tavinathanson! Not sure your earlier review went through. Either way I can't find it.
In general I think the exonic filter is the only one that might be redundant - if so, it is only redundant within certain variant types but not all. In my analysis since I'm comparing rates of types of mutations, it's important that I am 100% confident all counts are within exonic regions. Moving them to dups
would be undesirable since a user would then have to know if it was a 'dup' or not (and which version is considered the dup & which the primary) in order to import it.
@tavinathanson did a fair amount of refactoring to the functions, along the lines of the expressed_of
filter you suggested but taking it a bit further. The idea is to allow most of the variant/effects functions to be composable -- so a user could do the following:
missense_snv_count = only_missense(snv_count)
expressed_exonic_frameshift_indel_count = only_expressed(only_exonic(only_frameshift(indel_count)))
This wouldn't yet work for neoantigens, but could if we refactored the load_neoantigens
& related code a bit (essentially to limit variants/effects first & then compute neoantigens from that set).
Also, not sure yet about the naming convention proposed here, but as an approach I think this could give users (us) the flexibility sometimes required without having to create every combination of effects we might want in cohorts.functions.py
.
Finally, to make this work relatively seamlessly, I had to modify count_variants_function_builder
to instead load effects -- this way any future filtering function wouldn't have to know if it's working on a variant or an effect. Not sure what downstream implications this might have for validity/performance.
Note that I put in here an auto-naming feature, so that one could compute these on-the-fly.
IE cohort.plot_benefit(on=[only_exonic(only_frameshift(indel_count))])
and the result would be named as "exonic_frameshift_indel_count" by default. One could specify a custom "name" parameter, but may not want to.
Curious to know your thoughts on the general approach before I go too far down this road.
@tavinathanson just merged in your latest changes from master - would you mind reviewing again for a quick sanity check? I updated the description above to reflect all the various things that ended up in here.
A few changes ended up in this PR:
only_nonsynonymous(snv_count)
returns a function equivalent tononsynonymous_snv_count
. Similarly,only_exonic(only_nonsynonymous(snv_count))
returns the count of exonic, nonsynonymous snvs.utils.py
cohorts.cohort.caching
logger, to aid cache-debugging (vs other debugging)Also deleted a random
quick-start.Rmd
file that accidentally found its way into our master branch.