Local PCA/population structure (lostruct)

If you use this method, please cite Li & Ralph 2019:

Local PCA Shows How the Effect of Population Structure Differs Along the Genome, Han Li and Peter Ralph, Genetics January 1, 2019 vol. 211 no. 1 289-304.


@article {li2019local,
    author = {Li, Han and Ralph, Peter},
    title = {Local PCA Shows How the Effect of Population Structure Differs Along the Genome},
    volume = {211}, number = {1}, pages = {289--304}, year = {2019},
    doi = {10.1534/genetics.118.301747}, publisher = {Genetics}, journal = {Genetics}
}

Note: a prototype python implementation by Joseph Guhlin is available at github:jghulin/lostruct-py.

Installation

To install the package, make sure you have devtools (by doing install.packages("devtools")), and then running

install.packages("data.table")
devtools::install_github("petrelharp/local_pca/lostruct")
library(lostruct)

Note: the library is called lostruct.

Using the R package

The example scripts in the directories above mostly work without the R package. To start using the code on your own data, have a look at these files:

A quick example : in four lines of code, reads in chromosome 22 from a TPED, and does local PCA.
Setting up data : after documenting where the data are from, does local PCA on a small subset of the whole dataset, to establish how the functions work.
Drop in your own data and make a report with the scripts provided in the templated/ directory, modified from the Medicago analysis.

Prerequisites

To use the functions to read in windows from a VCF or BCF file, you will need bcftools.
To compile the example report, you probably want templater.

Quick start

The three steps are:

Compute the local PCA coordinates - done with eigen_windows().
Compute the distance matrix between windows - done with pc_dist() on the output of eigen_windows().
Visualize - whatever you want; MDS is implemented in R with cmdscale().

Data formats:

The function eigen_windows() basically wants your data in a numeric matrix, with one row per variant and one column per sample (so that x[i,j] is the number of alleles that sample j has at site i). If your data are already in this form, then you can use it directly.

We also have two methods to get data in from standard formats, tped and vcf. Neither are extensively tested: double-check what you are getting out of them.

TPED: the read_tped() function will read in a tped file and output a numeric matrix like the above. For instance:
```
snps <- read_tped("mydata.tped")
```
VCF, in memory: the read_vcf() function does the same. For instance:
```
snps <- read_vcf("mydata.vcf")
```
BCF, not in memory: eigen_windows() instead of a matrix can take a function that when called returns the submatrix corresponding to the appropriate window. (see documentation) Since we only need one window in memory at a time, this reduces the memory footprint. We use bcftools to extract the windows, so you need bcftools, and your vcf file must be converted to bcf (or bgzipped) and indexed. To do this, for instance, run:
```
bcftools convert -O b mydata.vcf > mydata.bcf
bcftools index mydata.bcf
```
Once you have this, the function vcf_windower() will create the window extractor function. For instance,
```
snps <- vcf_windower("my_data.bcf",size=1e3,type='snp')
snps(10)
```
will return the 10th window of 1,000 SNPs in the file my_data.vcf.

In any case, the next step is:

pcs <- eigen_windows(snps,k=2)
pcdist <- pc_dist(pcs,npc=2)

which gives you pcs, a matrix whose rows give the first two eigenvalues and eigenvectors for each window, and pcdist, the pairwise distance matrix between those windows.

Standalone code

Also included in this repository is code we used to analyze the datasets in the paper (before the R package was written). The general order to see the code in each directory is

recode : turn bases into numbers
PCA : find local PCs
distance : compute distance matrix between windows from local PCs
MDS : visualize the result

There are standalone examples for each of the three datasets studied:

POPRES (Homo sapiens, SNP chip data from a few worldwide populations)

Chromosome 1 is the example given. See also popres_example.R for an example of some steps using the package.

POPRES_SNPdata_recode12.R : recodes the TPED as numeric
POPRES_cov.R : computes covariance matrix for the entire chromsome 1
POPRES_PCA_win100.R : computes local PCs
POPRES_jackknife_var.R : estimates SE of local PCs
POPRES_distance.R : computes distance matrix from local PCs
POPRES_MDS.R : finds and plots MDS visualization of distance matrix

DPGP (Drosophila melanogaster population genome project)

Chromosome 3L is the example given .

DPGP_recode_and_cov.R : recodes data as numeric, removes individuals with more than 8% missing data, sites with more than 20% missing data, and computes whole-chromosome covariance matrix
DPGP_PCA_plot.R : plots PCs for entire 3L
DPGP_PCA_win103.R : computes local PCs along 3L in windows of 1000 SNPs
DPGP_var_between_win.R : computes variance of PCs between windows
DPGP_jackknife_var.R : does jackknife estimate of SE for PCs on windows of 1000 SNPs
DPGP_distance.R : computes distance matrix from local PCs
DPGP_MDS_1d.R : computes and plots MDS plots from distance matrix
DPGP_get_extreme_points.R : identifies extreme points (with interaction)
DPGP_combine_extremes_and_get_cov.R : combines each of three sets of extreme windows and computes covariances for each

Medicago (Medicago truncatula hapmap)

For Medicago, it calculates the pairwise distance for all 8 chromosome together and then apply MDS and use subset of the whole MDS result for each chromosome.

Medicago_VCF_recode.py : recodes VCF file as numeric
Medicago_recode_and_cov.R : computes covariance matrix for (entire) chromosome 1
Medicago_PCA_win104.R : computes local PCs for chromosome 1
Medicago_distance_all_chr.R : computes a distance matrix from PC information
Medicago_MDS.R : computes and plots MDS plots from the distance matrix

A note on implementation:

This method works through the genome doing something (PCA on the covariance matrix) one window at a time. Because of this, it can be frustratingly slow to first load the entire dataset into memory. There are several methods implemented here to avoid this; for instance, vcf_windower() which is used to compute PCs for the medicago data. The interface is via a function that takes an integer, n, and returns a data frame of the genomic data in the nth window.

petrelharp / local_pca

readme