statgen / emeraLD

tools to efficiently retrieve and calculate LD
32 stars 8 forks source link

LD difference between plink and emeraLD using unphased 1kg data #7

Open xtmgah opened 5 years ago

xtmgah commented 5 years ago

Hello:

I am trying to compare the result using 1kg unphased between plink and emeraLD, and I got totally different result. Also, I found a lot of LD generated by emeraLD have R vale >1 or <-1; Can you tell me the difference of LD calculation between plink and emeraLD? Here is one example; Thanks.

$emeraLD -i test.vcf.gz --stdout --dstats --no-phase emeraLD v0.1 (c) 2018 corbin quick (corbinq@gmail.com)

reading from vcf file...

assuming unphased data (reporting diploid genotype LD)...

processed genotype data for 503 individuals...

calculating LD for 2 SNPs...

CHR POS1 POS2 R Rsq D Dprime

9 133830619 134457580 -15.71375 246.92204 -0.66269 -99.00000 done!! thanks for using emeraLD

plink1.9 --vcf test.vcf.gz --ld rs3780269 rs111207562
PLINK v1.90b3.44 64-bit (17 Nov 2016) https://www.cog-genomics.org/plink2 (C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to plink.log. Options in effect: --ld rs3780269 rs111207562 --vcf test.vcf.gz

257652 MB RAM detected; reserving 128826 MB for main workspace. --vcf: plink-temporary.bed + plink-temporary.bim + plink-temporary.fam written. 2 variants loaded from .bim file. 503 people (0 males, 0 females, 503 ambiguous) loaded from .fam. Ambiguous sex IDs written to plink.nosex . Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 503 founders and 0 nonfounders present. Calculating allele frequencies... done. 2 variants and 503 people pass filters and QC. Note: No phenotypes present.

--ld rs3780269 rs111207562:

R-sq = 0.0145356 D' = 0.17131

Haplotype Frequency Expectation under LE


      AT      0.137429                0.165839
      GT      0.360583                0.332173
      AC      0.195573                0.167163
      GC      0.306415                0.334825

In phase alleles are AC/GT

corbinq commented 5 years ago

Thanks very much for the bug report.

The discrepancy in LD estimates appears to have been due to a bug processing unphased genotypes for variants with MAF > 50%.

Also, the "dstats" option is currently only supported for phased genotypes.

I've corrected the bug and added error messages accordingly. If you could confirm that the output looks OK, I'll go ahead and close this out.

corbinq commented 5 years ago

Also: To calculate the same r2 statistic using PLINK, use --r2 rather than --ld. Running the PLINK command below should produce output equivalent to emeraLD:

plink-1.9 --vcf test.vcf.gz --r2 --ld-snps rs3780269 rs111207562 --ld-window-r2 0