omerwe / polyfun

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
MIT License
85 stars 21 forks source link

some discrepancy between original LDSC-S and polyfun LDSC in enrichment analysis? #81

Closed Ojami closed 2 years ago

Ojami commented 2 years ago

Hi Omer,

I ran a compared two partiotioned heritability, one from PolyFun LDSC (py3) and the other one from original LDSC:

./ldsc.py \
--h2 sumstat.gz \
--ref-ld-chr baselineLF_v2.2.UKB/baselineLF2.2.UKB. \
--chisq-max 9999999.0 \
--not-M-5-50  \
--out /enrichment \
--overlap-annot  \
--w-ld-chr baselineLF_v2.2.UKB/weights.UKB. 

and

./ldsc.py \
--h2 sumstat.parquet\
--ref-ld-chr baselineLF2.2.UKB/baselineLF2.2.UKB. \
--chisq-max 9999999.0 \
--not-M-5-50  \
--out /enrichment \
--overlap-annot  \
--w-ld-chr baselineLF2.2.UKB/weights.UKB. 

sumstat.parquet and sumstat.gz are the same (such a pitty that LDSC still uses py 2.7 though). While I expected to get the same output, the enrichments differ for some terms (and therefore p-values). Although they have quite a good overlap. Is this normal and expected? As an example, for term H3K4me3_peaks_Trynka_common_0 enrichment from LDSC is -5.4102, while polyfun ldsc give -4.4186.

Also I noticed while lamdba GC from original LDSC makes sense (~1.07), Polyfun LDSC give a small value (<1e-7). Although univariate LDSC also gives reasonable lambda and ratio, partitioned LDSC gives a ratio < 1 (polyfun LDSC outputs NA for ratio).

Thanks! Oveis


UPDATE

I used baselineLF2.2.UKB for polyfun and baselineLF_v2.2.UKB for LDSC. Is that the reason why they're different (I assume it cannot be, v2.2 seems only the parquet version 2.2)? If so, still polyfun somehow gives strange labmda GC and doesn't estimate ratio (I assume it gives NA on purpose because of very low GC).

omerwe commented 2 years ago

Hi @Ojami, thanks for the bug report. I looked into this, there are two separate things going on here:

  1. There was a bug in the LDSC version of PolyFun, that affected reporting of lambda_GC and mean chi^2. I'm almost 100% sure it didn't affect anything except these two. I never noticed it because I never made use of these two things. I just pushed a fix to GitHub, can you please git pull and let me know if this solves the issue?

  2. There are some numerical differences in the PolyFun version of LDSC vs the "official" version. These are probably because PolyFun uses 32 bit representation to save memory, whereas LDSC uses 64 bit representation. My guess is that the inconsistent estimates are mostly in annotations with large standard errors (so there's a lot of uncertainty to begin with). I personally think incurring some small inconsistencies is worth it to save half the memory, but there's a small price to pay...

I hope this resolves this problem, please close the issue if it does!

Ojami commented 2 years ago

Fixed now! though I totally agree that S-LDSC users won't probably even look at GC, but may be useful in case of other normal LDSC functionalities (intercept for intance) especially that PolyFun's LDSC works with py 3.

Thanks Omer!