omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
89 stars 22 forks source link

is_indel #41

Closed jerome-f closed 3 years ago

jerome-f commented 3 years ago

Hi Omer,

There is this code block in the _polyfunutils.py that checks for indels.

def set_snpid_index(df, copy=False, allow_duplicates=False):
    if copy:
        df = df.copy()
    is_indel = (df['A1'].str.len()>1) | (df['A2'].str.len()>1)
    df['A1_first'] = (df['A1'] < df['A2']) | is_indel
    df['A1s'] = df['A2'].copy()

the logic here creates new snpid index which in the case of indels fails when the those indels are flipped in the summary stats compared to the LD panel. example below:

snpid
22.51164013.T.TGTG                          22_51164013_226995   22  51164013      T                          TGTG 

this ends up throwing a warning X variants with sumstats were not found in the LD file and will be omitted I am not sure if this is the intended behavior or a bug. If i comment out the is_indel in _polyfunutils.py then the finemapper.py identifies the flipped snps between the LD panel and the summary stats that are indels. Any help here is appreciated.

Best Jerome

omerwe commented 3 years ago

Hi Jerome,

This is the intended behavior. indels cannot be flipped like regular SNPs, because the flip implies a different meaning. Specifically, T->TGTG is an insertion, TGTG->T is a deletion. If we allow for flips, we cannot distinguish between them. For example, if the reference has TGTG and you see a variant T->TGTG in your sumstats, and you allow for flips in the allele order, it could mean two things:

  1. An individual has TGTGGTG in that locus
  2. An individual has T in that locus

You cannot distinguish between these two options, unless the order of the alleles is preserved between the sumstats and the reference file.

Cheers,

Omer

jerome-f commented 3 years ago

Got it. Thanks for the clarification Omer. I appreciate it.

Best Jerome

On Sun, Mar 14, 2021 at 1:39 AM Omer Weissbrod @.***> wrote:

Hi Jerome,

This is the intended behavior. indels cannot be flipped like regular SNPs, because the flip implies a different meaning. Specifically, T->TGTG is an insertion, TGTG->T is a deletion. If we allow for flips, we cannot distinguish between them. For example, if the reference has TGTG and you see a variant T->TGTG in your sumstats, and you allow for flips in the allele order, it could mean two things:

  1. An individual has TGTGGTG in that locus
  2. An individual has T in that locus

You cannot distinguish between these two options, unless the order of the alleles is preserved between the sumstats and the reference file.

Cheers,

Omer

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/omerwe/polyfun/issues/41#issuecomment-798877202, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBV7KX5N7TXYNUG6O2TNJLTDR76HANCNFSM4ZEYAUUQ .

-- Best Jerome