omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

Issues with precalculated UKBB LD matrices #138

Closed liqingbioinfo closed 1 year ago

liqingbioinfo commented 1 year ago

Hi Omer,

 I have some issues with UKBB precalculated LD matrices. I found many SNPs in *.gz files (downloaded from https://alkesgroup.broadinstitute.org/UKBB_LD/) have reversed allele1 and allele2 compared to dbSNP.  For example, I found the following two lines listed in the "chr1_45000001_48000001.gz"  file.

rsid chromosome position allele1 allele2 rs10890343 1 46157922 C T rs6665808 1 46162283 C T

 However, I check dbSNP (https://www.ncbi.nlm.nih.gov/snp/) and found the reference and alternative alleles should be as below:

rsid chromosome position reference alternative rs10890343 1 46157922 C T rs6665808 1 46162283 C T

Based on the above observations, it indicates that "allelel1" should be the "reference" and "allele2" should be the alternative. Is that correct? Should I do the same for my summary statistic files? Should all reference alleles be named "allele1," and should all my alternative alleles be named "allele2"?

 I have such a question because I got strange SuSiE results after running "finemapper.py". Since rs10890343  and rs6665808  are two SNPs in strong LD (0.99) from "chr1_45000001_48000001.npz" file, so SuSiE should have difficulty distinguishing them and grouping them into one credible set. However, I got two different credible sets, and each SNP has PIP equaling 1. The only reason I could find so far is that my "ALLELE2" matched "allele1" in your chr1_45000001_48000001.gz file. 

SNP CHR BP ALLELE2 ALLELE1 P PIP CREDIBLE_SET rs10890343 1 46157922 C T 0.00239711 1 chr1:45000001-48000001:3 rs6665808 1 46162283 C T 0.00244866 1 chr1:45000001-48000001:4

  ALLELE2 and ALLELE1 in my summary statistic files are matched to dbSNP reference, and alternative, respectively. And I think that's what most people do. So to ensure UKBB precalculated LD matrices can be utilized correctly in my summary statistic file, I need to rename "allele1" column in your *.gz file as "ALLELE2".  Alternatively, I should change that in my summary statistic files. But since my summary statistic files match the dbSNP database, they should be correct. What do you think?
 I really appreciate your detailed explanations and your time. 
omerwe commented 1 year ago

Hi @liqingbioinfo, I think you're asking two separate questions:

  1. Should the definitions of "reference" and "alternative" alleles match DBSNP? No, you don't need to worry about that. The definition of a reference allele is arbitrary anyway. PolyFun makes sure to align the alleles in all the input files, so it doesn't matter how you defined each allele. Just keep in mind that the estimated effect sizes are estimated with respect to the alternative allele (ALLELE2 in the SuSiE file).

  2. Why do two highly correlated SNPs get assigned to two different credible sets? I don't know. I would diagnose this in more detail. For example, do you get a similar results with FINEMAP? What would happen if you kept only these two SNPs in the analysis? etc.

If you think this is a bug, you may want to ask the authors of SuSiE what they thing...

Sorry I couldn't help more, I'm happy to answer more specific questions.

liqingbioinfo commented 1 year ago

Hi Omer

Thank you for clarifying the allele1 and allele2 definitions and millions thanks to including alleles alignment in PolyFun. I will ask SuSiE authors to interpret my strange credible sets. 

Best regards Leah