petrelharp / local_pca

Methods for examining PCA locally along the genome.
71 stars 13 forks source link

Local PCA/population structure (lostruct)

If you use this method, please cite Li & Ralph 2019:

Local PCA Shows How the Effect of Population Structure Differs Along the Genome, Han Li and Peter Ralph, Genetics January 1, 2019 vol. 211 no. 1 289-304.


@article {li2019local,
    author = {Li, Han and Ralph, Peter},
    title = {Local PCA Shows How the Effect of Population Structure Differs Along the Genome},
    volume = {211}, number = {1}, pages = {289--304}, year = {2019},
    doi = {10.1534/genetics.118.301747}, publisher = {Genetics}, journal = {Genetics}
}

Note: a prototype python implementation by Joseph Guhlin is available at github:jghulin/lostruct-py.

Installation

To install the package, make sure you have devtools (by doing install.packages("devtools")), and then running

install.packages("data.table")
devtools::install_github("petrelharp/local_pca/lostruct")
library(lostruct)

Note: the library is called lostruct.

Using the R package

The example scripts in the directories above mostly work without the R package. To start using the code on your own data, have a look at these files:

Prerequisites

Quick start

The three steps are:

  1. Compute the local PCA coordinates - done with eigen_windows().
  2. Compute the distance matrix between windows - done with pc_dist() on the output of eigen_windows().
  3. Visualize - whatever you want; MDS is implemented in R with cmdscale().

Data formats:

The function eigen_windows() basically wants your data in a numeric matrix, with one row per variant and one column per sample (so that x[i,j] is the number of alleles that sample j has at site i). If your data are already in this form, then you can use it directly.

We also have two methods to get data in from standard formats, tped and vcf. Neither are extensively tested: double-check what you are getting out of them.

In any case, the next step is:

pcs <- eigen_windows(snps,k=2)
pcdist <- pc_dist(pcs,npc=2)

which gives you pcs, a matrix whose rows give the first two eigenvalues and eigenvectors for each window, and pcdist, the pairwise distance matrix between those windows.

Standalone code

Also included in this repository is code we used to analyze the datasets in the paper (before the R package was written). The general order to see the code in each directory is

  1. recode : turn bases into numbers
  2. PCA : find local PCs
  3. distance : compute distance matrix between windows from local PCs
  4. MDS : visualize the result

There are standalone examples for each of the three datasets studied:

POPRES (Homo sapiens, SNP chip data from a few worldwide populations)

Chromosome 1 is the example given. See also popres_example.R for an example of some steps using the package.

DPGP (Drosophila melanogaster population genome project)

Chromosome 3L is the example given .

Medicago (Medicago truncatula hapmap)

For Medicago, it calculates the pairwise distance for all 8 chromosome together and then apply MDS and use subset of the whole MDS result for each chromosome.

A note on implementation:

This method works through the genome doing something (PCA on the covariance matrix) one window at a time. Because of this, it can be frustratingly slow to first load the entire dataset into memory. There are several methods implemented here to avoid this; for instance, vcf_windower() which is used to compute PCs for the medicago data. The interface is via a function that takes an integer, n, and returns a data frame of the genomic data in the nth window.