related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

Determine why Hail population stratification results have minor, irritating differences from PLINK results #1

Closed eric-czech closed 4 years ago

eric-czech commented 4 years ago

The PLINK population stratification tutorial results in a projection like this:

Screen Shot 2020-01-20 at 2 29 22 PM

zooming in on EUR (1KG Europeans) and OWN (HapMap individuals):

Screen Shot 2020-01-20 at 2 29 31 PM

The comparable Hail results look like this:

Screen Shot 2020-01-20 at 2 31 29 PM

zoom on EUR + OWN:

Screen Shot 2020-01-20 at 2 31 44 PM

In both cases, the HapMap samples can be isolated with the 1KG EUR samples w/o including samples from some other 1KG population, but the extra structure inherent to the HapMap samples that Hail is implying is intriguing. The PLINK results show almost perfect compatibility between the two datasets and do not indicate any meaningful difference.

The PLINK projection is using IBS to get a distance measure and then passing this to an MDS implementation. The Hail version, adopted from the Hail GWAS Tutorial, is using a HWE-normalized PCA directly over the call matrix.

I have so far tried using the identity_by_descent implementation as well as the genetic_relatedness_matrix function in Hail to get a precomputed dissimilarity matrix for sklearn MDS, but none of these are netting me satisfying results. For example, this MDS over the GRM shows greater compatibility between the EUR and OWN samples but less separation with all others:

Screen Shot 2020-01-20 at 2 41 02 PM

I think some potential steps forward are:

  1. Compare the IBS calculations from Hail and PLINK directly
  2. Try PLINK PCA instead of IBS + MDS and see how this compares to the Hail results
  3. Dive deeper into the PCA results, possibly using loadings to get a better sense of what makes the HapMap samples a little different from the 1KG EUR samples (but not so different as other ethnicities)

I would imagine that these minor differences are very common when comparing populations from separate datasets so though 1 and 2 may help get equivalent results, 3 is probably worth spending the most time on in order to gain familiarity with methods for doing this in a real analysis.

eric-czech commented 4 years ago

After narrowing down possibilities, I eventually found that it was an issue in how I did the variant liftover between the 1KG and HapMap datasets. I was intentionally avoiding using the liftover utilities in Hail since the tutorial does it completely differently (not that it would necessarily be easier that way), and I missed some things that PLINK was doing. My best impression of the PLINK version now infers strand and reference allele flips before inverting the associated genotype calls. Now the EUR and 1KG populations are much more compatible. I also found that the Hail hwe_normalized_pca results and the PLINK PCA results are equivalent outside of PC scale/sign.

Current Hail Results:

Screen Shot 2020-01-22 at 4 59 17 PM

PLINK PCA Results:

Screen Shot 2020-01-22 at 5 00 46 PM