Don't cache after only top-priority effects; post-filter instead

tavinathanson commented 7 years ago

Oof, I found what seems to be a gnarly bug in Cohorts. It shouldn't impact any conclusions we've made, but it does impacts effect/variant-with-effect counts.

Notebook with example: https://github.com/hammerlab/bladder-analyses/blob/master/analyses/notebooks/Surfacing%20a%20load_effects%20bug%20in%20Cohorts.ipynb

Up until this PR, we never saved all effects; only the top-priority effects. Currently, we do save all effects, but usually we only use the top-priority effects.

What if we have a Substitution that is lower priority than another effect, but we end up filtering to just Substitutions? I ran into this while trying to generate all missense SNVs in the bladder cohort ( (and verify # == missense_snv_count) for Alex S.

Note: I could see an argument for this not being a bug in some scenarios, when we only care about the top priority effect.

The solution seems clear: we need to apply top-priority filtering as a post-filter rather than caching it.

tavinathanson commented 7 years ago

The good news is that missense_* and nonsynonymous_* functions use only_nonsynonymous=True, which is cached separately and helps a lot (it filters by priority after filtering by silent/noncoding).

This bug could explain why we sometimes find discrepancies between functions that should be equivalent.

tavinathanson commented 7 years ago

Oops, this is worse than I thought. https://github.com/hammerlab/cohorts/pull/165 made it so only_nonsynonymous uses the effect collection that was previously filtered to top priority, and then filters further. That is definitely wrong.

tavinathanson commented 7 years ago

Fixed in #254

hammerlab / cohorts

Don't cache after only top-priority effects; post-filter instead #252