omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

run S-LDSC to get the prior causal probabilities using self-defined annotations #136

Closed wuyangf7 closed 1 year ago

wuyangf7 commented 1 year ago

Hi Omer,

I just constructed my own annotations and computed LD scores with your wiki instructions. However, when I used the ldscores to compute the prior causal probabilities via an L2-regularized extension of S-LDSC, I got an error message "ValueError: Duplicate SNPs were found in the input data". But I have checked there is no duplicated SNPs in both my GWAS summary and LD reference. Please can you help point out what the likely problem is?

omerwe commented 1 year ago

Hi @wuyangf7, does the message come from LDSC (parse.py) or from polyfun_utils.py? Can you please copy-paste the last few lines of the error message?

If you're somewhat familiar with Python, you could modify the code of the relevant script to insert a breakpoint right before the line that raises the Exception:

import ipdb
ipdb.set_trace()

Then, in the debugger you could look at the identities of the duplicated SNPs. I can write more detailed instructions once I understand where exactly the error is coming from.

wuyangf7 commented 1 year ago

Hi Omer,

I checked it comes from polyfun_utils.py. Please find the last few lines of the error message below.

File "/scratch/90days/uqywu16/devlop/PP2Pval/polyfun/polyfun/polyfun_utils.py", line 82, in set_snpid_index raise ValueError(error_msg) ValueError: Duplicate SNPs were found in the input data: SNP CHR BP A1 A2 snpid 1.754182.A.G rs3131969 1 754182 A G 1.768448.A.G rs12562034 1 768448 A G 1.779322.A.G rs4040617 1 779322 G A 1.838555.A.C rs4970383 1 838555 A C 1.846808.C.T rs4475691 1 846808 T C ... ... ... ... .. .. 22.51171497.A.G rs2301584 22 51171497 A G 22.51171693.A.G rs756638 22 51171693 A G 22.51175626.A.G rs3810648 22 51175626 G A 22.51178090.A.G rs2285395 22 51178090 A G 22.51219006.A.G rs28729663 22 51219006 A G

[1038220 rows x 5 columns]

omerwe commented 1 year ago

@wuyangf7 I forgot that the code already prints out the duplicate SNPs. As you can see, it clearly thinks that (e.g.) that the SNP in chromosome 1 position 754182 is duplicated.

I don't think I can help unless you want to send me a reproducible example. Please double-check that you don't have two duplicated SNPs at chromosome 1 position 754182. Maybe they have different rsids, but the code looks at the SNP position + alleles, not the rsid (because rsids aren't necessarily unique). If you want to send me a reproducible example, please let me know and we'll arrange that.

wuyangf7 commented 1 year ago

Hi Omer,

Thanks. I checked these SNPs are all unique and have unique positions. The print results are containing all the summary data (dimension is 1038220 * 5 columns). I tried the same code with your suggested annotation and found the code works well. It met with a problem when I used the annotation created by myself following your instruction here, https://github.com/omerwe/polyfun/wiki/2.-Using-and-creating-functional-annotations. Would it likely be the problem that the annotation has two additional columns (allele information) compared to the standard LDSC? Thanks for your help!