omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
96 stars 22 forks source link

Feature request: flag to control how to process indels #66

Closed jdblischak closed 3 years ago

jdblischak commented 3 years ago

This is a follow up to Issue #41 from @jerome-f.

When the allele order for an indel at the same coordinate differs between the sumstats file and the reference genotypes, there are 2 possible reasons:

  1. The distinction between A1 and A2 is biologically meaningful, and one indicates an insertion and the other a deletion (e.g. A1: C A2: TT vs A1: TT A2: C)
  2. The distinction is arbitrary. There are many reason why GWAS summary statistics may swap the 2 alleles, e.g. A1 might always be the allele with higher frequency or the effect allele

The current version of set_snpid_index() assumes the first scenario, and treats the swapped alleles as if they were completely different variants. This prevents any mistakes arising from misinterpreting insertions and deletions.

However, it has other consequences for scenario 2. These indels are removed prior to fine-mapping, thus removing potential causal variants:

https://github.com/omerwe/polyfun/blob/b4655d0cdea44da39bbc60e664b2146228b241ad/finemapper.py#L270-L273

Also, it always invalidates the cached LD matrix file, and thus the LD matrix is always re-calculated:

https://github.com/omerwe/polyfun/blob/b4655d0cdea44da39bbc60e664b2146228b241ad/finemapper.py#L352-L354

I don't have a good sense for how often there are polymorphic parallel insertions/deletions at the same base pair coordinate. It seems to me like it would be a rare event, but I don't have any data to back up my intuition. And I understand why you would want to be cautious when combining sumstats with a reference panel such as the UKBB. However, in the case of using an insample LD matrix, this seems to only have downsides. If you are fine-mapping with the exact same genotypes you used for the original GWAS, it seems safe to assume that the 2 alleles were simply re-ordered.

Would you be open to adding a flag to finemapper.py to toggle this behavior? The default behavior would remain the same, but users could specify a flag such as --flip-indel-alleles to prevent removing these indels when fine-mapping with an insample LD matrix. I'm happy to implement everything, but I wanted to get your approval first.

omerwe commented 3 years ago

Hi John,

Sure, this sounds like a very reasonable idea. I mostly worked with UKBB sumstats, so I'm also not sure how often this might happen in practice. I'd rather keep the default behavior as it currently is to stay on the safe side. However, it might make sense to emit a warning in case we observed a "flipped" indel (with a suggestion to invoke the new flag), so that the users are aware of the potential loss of information.

I'm pretty constrained for time at the moment. However, if you can introduce a pull request with this flag, I'll be happy to accept it. Otherwise I'll try to get around to it in one of the next few weekends.

jdblischak commented 3 years ago

I'd rather keep the default behavior as it currently is to stay on the safe side.

Agreed

However, it might make sense to emit a warning in case we observed a "flipped" indel (with a suggestion to invoke the new flag), so that the users are aware of the potential loss of information.

That's a good idea. I'll include some warnings

I'm pretty constrained for time at the moment. However, if you can introduce a pull request with this flag, I'll be happy to accept it. Otherwise I'll try to get around to it in one of the next few weekends.

I've already started working on it. I'll send the PR when it's ready.

Thanks!

jerome-f commented 3 years ago

Hi John,

We might have to have both the flags available to be set -allow-swapped-indel-alleles and --allow-missing, assuming we can have indels flipped and few snps that are genuinely missing in the summary stats.

jdblischak commented 3 years ago

We might have to have both the flags available to be set -allow-swapped-indel-alleles and --allow-missing, assuming we can have indels flipped and few snps that are genuinely missing in the summary stats.

You can set both flags at the same time. Any variant that is truly missing from the LD matrix (be it a SNP or an indel) will still be removed prior to fine-mapping. The flag --allow-swapped-indel-alleles only saves those indels where the chromosome and position are identical and the alleles are swapped.