hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Add indel effects #243

Open jburos opened 7 years ago

jburos commented 7 years ago

A few changes ended up in this PR:

Also deleted a random quick-start.Rmd file that accidentally found its way into our master branch.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.2%) to 52.86% when pulling 95da13eafba18927c8bdb4a321a2b298f3bc7744 on add-indel-effects into c5844cd91a5e27308b59d870b4bf34dfe331e80e on master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.3%) to 52.409% when pulling fd3b28398951eb88d4ae763387bed9eacd1dbca7 on add-indel-effects into c5844cd91a5e27308b59d870b4bf34dfe331e80e on master.

jburos commented 7 years ago

Thanks @tavinathanson! Not sure your earlier review went through. Either way I can't find it.

In general I think the exonic filter is the only one that might be redundant - if so, it is only redundant within certain variant types but not all. In my analysis since I'm comparing rates of types of mutations, it's important that I am 100% confident all counts are within exonic regions. Moving them to dups would be undesirable since a user would then have to know if it was a 'dup' or not (and which version is considered the dup & which the primary) in order to import it.

jburos commented 7 years ago

@tavinathanson did a fair amount of refactoring to the functions, along the lines of the expressed_of filter you suggested but taking it a bit further. The idea is to allow most of the variant/effects functions to be composable -- so a user could do the following:

missense_snv_count = only_missense(snv_count)
expressed_exonic_frameshift_indel_count = only_expressed(only_exonic(only_frameshift(indel_count)))

This wouldn't yet work for neoantigens, but could if we refactored the load_neoantigens & related code a bit (essentially to limit variants/effects first & then compute neoantigens from that set).

Also, not sure yet about the naming convention proposed here, but as an approach I think this could give users (us) the flexibility sometimes required without having to create every combination of effects we might want in cohorts.functions.py.

Finally, to make this work relatively seamlessly, I had to modify count_variants_function_builder to instead load effects -- this way any future filtering function wouldn't have to know if it's working on a variant or an effect. Not sure what downstream implications this might have for validity/performance.

Note that I put in here an auto-naming feature, so that one could compute these on-the-fly.

IE cohort.plot_benefit(on=[only_exonic(only_frameshift(indel_count))]) and the result would be named as "exonic_frameshift_indel_count" by default. One could specify a custom "name" parameter, but may not want to.

Curious to know your thoughts on the general approach before I go too far down this road.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+1.5%) to 54.238% when pulling d242c443afd49384783d27f713d4a913003a6eed on add-indel-effects into c5844cd91a5e27308b59d870b4bf34dfe331e80e on master.

jburos commented 7 years ago

@tavinathanson just merged in your latest changes from master - would you mind reviewing again for a quick sanity check? I updated the description above to reflect all the various things that ended up in here.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+1.7%) to 54.42% when pulling 1c0429fb330029365a073621016fff6063a81449 on add-indel-effects into f9f4af2384760a656acadbfd74c5ccd8fade522a on master.