If you use this method, please cite Li & Ralph 2019:
Local PCA Shows How the Effect of Population Structure Differs Along the Genome, Han Li and Peter Ralph, Genetics January 1, 2019 vol. 211 no. 1 289-304.
@article {li2019local,
author = {Li, Han and Ralph, Peter},
title = {Local PCA Shows How the Effect of Population Structure Differs Along the Genome},
volume = {211}, number = {1}, pages = {289--304}, year = {2019},
doi = {10.1534/genetics.118.301747}, publisher = {Genetics}, journal = {Genetics}
}
Note: a prototype python implementation by Joseph Guhlin is available at github:jghulin/lostruct-py.
To install the package, make sure you have devtools
(by doing install.packages("devtools")
),
and then running
install.packages("data.table")
devtools::install_github("petrelharp/local_pca/lostruct")
library(lostruct)
Note: the library is called lostruct
.
The example scripts in the directories above mostly work without the R package. To start using the code on your own data, have a look at these files:
A quick example : in four lines of code, reads in chromosome 22 from a TPED, and does local PCA.
Setting up data : after documenting where the data are from, does local PCA on a small subset of the whole dataset, to establish how the functions work.
Drop in your own data and make a report with the scripts provided in the templated/
directory,
modified from the Medicago analysis.
The three steps are:
eigen_windows()
.pc_dist()
on the output of eigen_windows()
.cmdscale()
.Data formats:
The function eigen_windows()
basically wants
your data in a numeric matrix,
with one row per variant and one column per sample
(so that x[i,j]
is the number of alleles that sample j
has at site i
).
If your data are already in this form, then you can use it directly.
We also have two methods to get data in from standard formats,
tped
and vcf
. Neither are extensively tested:
double-check what you are getting out of them.
TPED: the read_tped()
function will read in a tped file and output a numeric matrix like the above.
For instance:
snps <- read_tped("mydata.tped")
VCF, in memory: the read_vcf()
function does the same.
For instance:
snps <- read_vcf("mydata.vcf")
BCF, not in memory: eigen_windows()
instead of a matrix can take a function that when called returns the submatrix
corresponding to the appropriate window. (see documentation)
Since we only need one window in memory at a time, this reduces the memory footprint.
We use bcftools
to extract the windows, so you need bcftools,
and your vcf file must be converted to bcf (or bgzipped) and indexed.
To do this, for instance, run:
bcftools convert -O b mydata.vcf > mydata.bcf
bcftools index mydata.bcf
Once you have this,
the function vcf_windower()
will create the window extractor function. For instance,
snps <- vcf_windower("my_data.bcf",size=1e3,type='snp')
snps(10)
will return the 10th window of 1,000 SNPs in the file my_data.vcf
.
In any case, the next step is:
pcs <- eigen_windows(snps,k=2)
pcdist <- pc_dist(pcs,npc=2)
which gives you pcs
, a matrix whose rows give the first two eigenvalues and eigenvectors for each window,
and pcdist
, the pairwise distance matrix between those windows.
Also included in this repository is code we used to analyze the datasets in the paper (before the R package was written). The general order to see the code in each directory is
There are standalone examples for each of the three datasets studied:
Chromosome 1 is the example given. See also popres_example.R for an example of some steps using the package.
Chromosome 3L is the example given .
For Medicago, it calculates the pairwise distance for all 8 chromosome together and then apply MDS and use subset of the whole MDS result for each chromosome.
This method works through the genome doing something (PCA on the covariance matrix)
one window at a time. Because of this, it can be frustratingly slow to first load
the entire dataset into memory. There are several methods implemented here to avoid this;
for instance, vcf_windower()
which is used to compute PCs for the medicago data.
The interface is via a function that takes an integer, n
,
and returns a data frame of the genomic data in the n
th window.