omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

The rationale behind set_snpid_index #147

Closed Ojami closed 1 year ago

Ojami commented 1 year ago

Hi,

I may be quite wrong, but I was wondering why set_snpid_index in parse function behaves in a way that flips A1/A2 in the resulting dataframe index for some variants? I have a summary stat file, with this particular SNP (output of set_snp_index):

         SNP         CHR      BP          Z         N      A1     A2     A1_first    A1s    A2s         snpid      
    _____________    ___    _______    _______    _____    ___    ___    ________    ___    ___    ________________

    "rs572757938"    21     9660864    0.47446    35146    "A"    "G"     true       "A"    "G"    "21.9660864.A.G"

Next, when the function parses baselineLF2.2.UKB.21.l2.ldscore.parquet, it flips A1/A2 (annotation columns have been removed):

    CHR         SNP           BP       A1     A2     A1_first    A1s    A2s         snpid      
    ___    _____________    _______    ___    ___    ________    ___    ___    ________________

    21     "rs572757938"    9660864    "G"    "A"     false      "A"    "G"    "21.9660864.A.G"

So, from PolyFun's perspective, this is a match, however A1/A2 is different between the sumstat and ld.score file. Shouldn't the function flip the sign of Z column for such SNPs as well? Am I missing something?

Thanks Best/Oveis

omerwe commented 1 year ago

@Ojami set_snpid_index doesn't flip A1 or A2, it just creates a unique id for each SNP. This is just our way to address the fact that rsids aren't necessarily unique 🤦 When the software needs to look at Z scores, it does take care to look at which variant is in A1 and which is in A2 (regardless of the unique SNP index).

I hope it's clear, please let me know if not

Ojami commented 1 year ago

Thanks for your answer Omer!

Yes that cleared it up! I was thinking more from a matching perspective. As an example, if I flip A1/A2 columns in my input sumstat file (and change BETA and A1FREQ columns), the number of overlapping variants will differ compared to the original file (--allow-missing). So, ideally (given the same population as UKBB in my example), one would opt for the maximum overlap between sumstat and ldscore files (?).

I guess smart_merge can do that if finds such non-matching variants between sumstat and ldscore:

 if x.index.name == 'snpid' and y.index.name == 'snpid':
            out = pd.merge(x, y.drop(columns=['SNP']), how='inner', left_index=True, right_index=True)

            df_nonmatching = y[~y.index.isin(x.index)]
            x = x[~x.index.isin(out.index)]
            df_nonmatching.index = df_nonmatching.index.str.split('.').str[:2].str.join(
                '.') + '.' + df_nonmatching.index.str.split('.').str[2:].str[::-1].str.join('.')

            df_matching = pd.merge(x, df_nonmatching.drop(columns=['SNP']),
                                                how='inner', left_index=True,
                                                right_index=True)

            # flip A1, A2 and sing of Z
            if df_matching.shape[0] > 0:
                df_matching['Z'] = -df_matching['Z']
                df_matching['A1'], df_matching['A2'] = df_matching['A2'], df_matching['A1']

            out = pd.concat([out, df_matching], axis=0)
            out = out.loc[np.sort(out.index)]
omerwe commented 1 year ago

@Ojami flipping A1 and A2 for SNPs shouldn't change the number of matching variants. However, please note that for small indels, the identities of A1 and A2 actually make a real difference, because they allow it to distinguish between deletion (AA->A) and insertion (A->AA). I suspect that the different numbers you've seen are because of small indels.

If you can come up with a simple example that demonstrates these changing numbers (and isn't due to indels), I'm happy to look at it.

Ojami commented 1 year ago

@omerwe Yes they are all indels in fact (apologies I forgot to mention that). my snippet above can handle such indels, but it's not computationally friendly. I guess in the end, user must take care of such indels beforehand and flip them if necessary. On a similar and related note, compute_ld_bgen would throw error when trying to assert df_snp.shape[0]==1 because of non-unique rsids, which can be fixed by:

        df_z['ID'] = df_z[['CHR', 'BP', 'A1', 'A2']].apply(lambda row: '.'.join(row.values.astype(str)), axis=1)
        rsids = bfile.fetch(self.chr, locus_start, locus_end)

        for snp_i, rsid in enumerate(rsids):
            if rsid.rsid not in df_z['SNP'].values: continue
            snp_alleles = rsid.alleles
            snp_chrom = rsid.chrom
            snp_pos = rsid.pos
            assert len(snp_alleles) == 2, 'cannot handle SNPs with more than two alleles'
            df_snp = df_z.query('SNP == "%s"' % (rsid.rsid))

            try:
                assert df_snp.shape[0] == 1
            except:
                # handle non-unique rsids in sumstat
                rsid_index = [f"{rsid.chrom}.{rsid.pos}.{alleles}" for alleles in [".".join(rsid.alleles)]]
                df_snp = df_snp.loc[df_snp.ID.isin(rsid_index)]

Best/Oveis

jdblischak commented 1 year ago

Yes they are all indels in fact (apologies I forgot to mention that). my snippet above can handle such indels, but it's not computationally friendly. I guess in the end, user must take care of such indels beforehand and flip them if necessary.

@Ojami I think you might be interested in the flag --allow-swapped-indel-alleles, which I implemented in #67. This changes the default behavior to be less stringent when matching indels (and thus you wouldn't have to manually flip them beforehand).

Ojami commented 1 year ago

@jdblischak Thanks for this! Yes I checked this out as well, but as far as I see, this flag has not ben implemented in polyfun and is only available in finemapper. Am I mistaken?

jdblischak commented 1 year ago

this flag has not ben implemented in polyfun and is only available in finemapper

That's correct. I needed this flag when I was performing non-functional fine-mapping. It makes sense to add it to polyfun.py as well, since it is calling the same underlying function polyfun_utils.set_snpid_index()

Ojami commented 1 year ago

I see! I'll try to add it to polyfun as well. This will fix the problem with the indels matching. Thanks a bunch!

jdblischak commented 1 year ago

I'll try to add it to polyfun as well.

@Ojami Just checking in for a status update. Do you still want to implement the flag and send a Pull Request?

Ojami commented 1 year ago

@jdblischak not yet, didn't have much time to check it. But did a quick try and looks polyfun_utils.set_snpid_index() is not enough and parse.set_snpid_index should be edited accordingly, since this is the function that poylfun calls first.

Ojami commented 1 year ago

@jdblischak there may be an issue with allow-swapped-indel-alleles flag when both A1/A2 and A2/A1 (in and del) are present in the sumstats file:

    CHR      BP            SNP           N        MAF         A2         A1           Z 
    ___    _______    _____________    _____    ________    _______    _______    _________ 

     1     5810611    "rs541092568"    35146     0.16259    "T"        "TTGTC"    -0.056526
     1     5810611    "rs541092568"    35146    0.011225    "TTGTC"    "T"          0.13351 

Since allow_duplicates is False by default (which makes sense), set_snpid_index would throw an error in such cases. Maybe user must dedup such indels beforehand (maybe keep the one with lowest P). What do you think?

Best/Oveis

jdblischak commented 1 year ago

there may be an issue with allow-swapped-indel-alleles flag when both A1/A2 and A2/A1 (in and del) are present in the sumstats file:

@Ojami Now I'm a bit confused. Your data clearly distinguish between the insertion and deletion. Each indel has its own MAF and Z score. If these were measuring the same allele, then the MAF should be the same, and the Z score would only have a different sign.

I created the flag --allow-swapped-indel-alleles (PR #67) for the use case where both my GWAS data and my reference LD matrix only had a single indel variant at the position, and the indel was getting unnecessarily removed simply because the A1/A2 designations were swapped.

Based on skimming the above thread, I think you should be using the default behavior that @omerwe described above:

please note that for small indels, the identities of A1 and A2 actually make a real difference, because they allow it to distinguish between deletion (AA->A) and insertion (A->AA). I suspect that the different numbers you've seen are because of small indels.

But in general, I am only comfortable advising you on the technical aspects of running PolyFun. In other words, I can tell you which flags to use to get a certain behavior. But I don't have the time to determine which behavior is correct for your particular data set. Either you or a colleague will need to determine how to handle these indel variants for your fine-mapping application. For example, if you are most interested in fine-mapping a few particular regions, and those regions don't have these indel variants, then you could save yourself the trouble by simply removing them