Closed tavinathanson closed 7 years ago
The good news is that missense_*
and nonsynonymous_*
functions use only_nonsynonymous=True
, which is cached separately and helps a lot (it filters by priority after filtering by silent/noncoding).
This bug could explain why we sometimes find discrepancies between functions that should be equivalent.
Oops, this is worse than I thought. https://github.com/hammerlab/cohorts/pull/165 made it so only_nonsynonymous
uses the effect collection that was previously filtered to top priority, and then filters further. That is definitely wrong.
Fixed in #254
Oof, I found what seems to be a gnarly bug in Cohorts. It shouldn't impact any conclusions we've made, but it does impacts effect/variant-with-effect counts.
Notebook with example: https://github.com/hammerlab/bladder-analyses/blob/master/analyses/notebooks/Surfacing%20a%20load_effects%20bug%20in%20Cohorts.ipynb
Up until this PR, we never saved all effects; only the top-priority effects. Currently, we do save all effects, but usually we only use the top-priority effects.
What if we have a Substitution that is lower priority than another effect, but we end up filtering to just Substitutions? I ran into this while trying to generate all missense SNVs in the bladder cohort ( (and verify # ==
missense_snv_count
) for Alex S.Note: I could see an argument for this not being a bug in some scenarios, when we only care about the top priority effect.
The solution seems clear: we need to apply top-priority filtering as a post-filter rather than caching it.