popgenmethods / pyrho

Fast inference of fine-scale recombination rates based on fused-LASSO
MIT License
42 stars 4 forks source link

pyrho

Fast demography-aware inference of fine-scale recombination rates based on fused-LASSO

Table of Contents

NB: in version 0.1.0 we switched everything from coalescent units to natural scale -- that is inputs should now be in terms of Ne, generations, and per-base, per-generation mutation rates. The output is now also automatically scaled to be the per-generation recombination rate.

Human Recombination Maps

We have inferred recombination maps for each of the 26 populations in phase 3 of the 1000 Genomes Project. Those maps are available at this link including maps inferred for the two most recent genome builds hg19/GRCh37 and hg38/GRCh38.

The recombination maps for hg38/GRCh38 are now also available for simulations using the wonderful stdpopsim package.

Human Population Sizes

When making the recombination maps for the 1000 Genomes Project populations we used smc++ to infer population size histories for each population. Those size histories are plotted in Figure 2 of the pyrho paper, and the data used to make those plots are available in smcpp_popsizes_1kg.csv. The x column is time in years assuming a generation time of 29 years, and the y column contains the population size in Ne.

Installation

pyrho is compatible with both python 2 and python 3. pyrho makes use of a number of external packages including the excellent numba, msprime, and cyvcf2 packages. As such it is recommended to install pyrho in a virtual environment.

If using conda this can be accomplished by running

conda create -n my-pyrho-env
conda activate my-pyrho-env

or using

virtualenv my-pyrho-env
source my-pyrho-env/bin/activate

Once you have set up and activated your virtual environment, you must first install ldpop. Change to a directory where you want to store ldpop and then run

git clone https://github.com/popgenmethods/ldpop.git ldpop
pip install ldpop/

Note that this will create a directory ldpop.

pyrho makes use of msprime, which requires gsl and hdf. pyrho also has a dependency on openssl. If you do not have these installed, these can be installed using apt-get, yum, conda, brew etc...

For example, to install openssl on Ubuntu run:

sudo apt-get install libssl-dev

You will also need to have cython installed. If you do not yet have it installed, run

pip install cython

You should be able to then just clone and install pyrho by running

git clone https://github.com/popgenmethods/pyrho.git pyrho
pip install pyrho/

You can check that everything is running smoothly by running

python -m pytest pyrho/tests/tests.py

NB: the first time you run pyrho, numba will compile and cache a number of functions, which can take up to ~30 seconds.

Usage

pyrho has a command line interface and consists of a number of separate commands. A typical workflow is to first use make_table to build a lookup table and then use hyperparam to find reasonable hyperparameter settings for your data. Finally, use optimize to infer a fine-scale recombination map from data. There is also a command, compute_r2, that computes statistics of the theoretical distribution of r2.

make_table

Before performing any inference, you must first compute a lookup table. A standard use case would be

pyrho make_table --samplesize <n> --approx  --moran_pop_size <N> \
--numthreads <par> --mu <mu> --outfile <outfile> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2>

which indicates that we should compute a lookup table for a sample of size <n> , from a population where at present the size is <size1> , at <breakpoint1> generations in the past the size was <size2> and so on, with a per-generation mutation rate <mu> The --numthreads option tells pyrho to use <par> processors when computing the lookup table. Finally, --approx with --moran_pop_size tells pyrho to compute an approximate lookup table for a larger sample size <N> and then downsample to a table for size <n> . In general <N> should be about 25-50% larger than <n> . Without using the --approx flag, pyrho can compute lookup tables for <n> < ~50, whereas with the --approx flag, pyrho can handle sample sizes in the hundreds (e.g., <n> = 200, <N> = 256) with little loss in accuracy as long as <n> << <N>.

The output is an hdf format table containing all of the pre-computed likelihoods needed to run hyperparam, optimize, and compute_r2.

Note that make_table can consume significant amounts of memory (N=210 requires about 20G of RAM using the --approx flag).

To see a full list of options and their meaning, run pyrho make_table --help.

hyperparam

After computing a lookup table with make_table, it is a good idea to find reasonable settings for the main hyperparameters of pyrho: the window size and the smoothness penalty. This command simulates data and performs optimization under a number of different hyperparameter settings and outputs the accuracy in terms of a number of different metrics. A typical usage is

pyrho hyperparam --samplesize <n> --tablefile <make_table_output> \
--mu <mu> --ploidy <ploidy> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2>  \
--outfile <output_file>

where <n> is your haploid sample size, <make_table_output> is the output of running make_table, and <mu>, <size1>..., and <breakpoint1>... are as above. <ploidy> should be set to 1 if using phased data and 2 for unphased genotype data. Ploidies other than 1 or 2 are not currently supported.

The output is a table containing various measures of accuracy of the inferred maps compared to the recombination maps used in the simulations (drawn from the hapmap recombination map). The optimal hyperparameters will depend on your application -- it may be important to maximize a particular measure of correlation or to minimize the L2 norm between the true and inferred maps. In any case, you can use the results in the output table to choose the hyperparameters for running optimize.

To see a full list of options and their meaning, run pyrho hyperparam --help.

optimize

After computing a lookup table using make_table and choosing reasonable hyperparameters (optionally using hyperparam) you are ready to infer recombination maps from real data.

pyrho supports data in fasta format and LDhat's sites and locs format, but it is easiest to directly use VCF formatted data. pyrho supports VCF, bgzipped VCF, and BCF format files. If using LDhat's formats see the note about using sites and locs.

A typical usage is

pyrho optimize --vcffile <data> --windowsize <w> --blockpenalty <bpen> \
--tablefile <make_table_output> --ploidy <ploidy> --outfile <output_file> \
--numthreads <par>

with <data> being a VCF, bgzipped VCF, or BCF file containing a single chromosome, <w> and <bpen> being hyperparameters chosen using hyperparam, and <ploidy> should be 1 for phased data and 2 for unphased data.

The output file has three columns -- the first column is the zero-indexed start of an interval, the second column is the end of the interval (non-inclusive), and the third column is r, which is the per-base per-generation recombination rate in that interval.

To see a full list of options and their meaning, run pyrho optimize --help.

pyrho optimize can be slow if there is a lot of missing data. Consider removing SNPs with more than a few missing genotypes to increase the speed. Another option is to use the --fast_missing option. This option uses a bit more memory, and throws away some data, but can be substantially faster for datasets with lots of missing data, and does not require pre-filtering SNPs for missingness.

A note about LDhat format input

The preferred input format for pyrho is a VCF format file containing a single chrmosome. LDhat format files may also be used with the following important caveat.

If using LDhat formatted data (i.e., a sites file and a locs file) note that we use a slighly different convention than LDhat (sorry!). For unphased data, LDhat uses the convention 0 = homozygous reference, 1 = homozygous alternate, 2 = heterozygous. We do not use this convention.

We use 0 = homozygous reference, 1 = heterozygous, 2 = homozygous alternate, and N = missing. That is, each entry should be the number of alternate alleles, or N for missing. We otherwise follow the formatting (including the headers) of LDhat as described here.

compute_r2

compute_r2 computes statistics of the theoretical distribution of r2 using a lookup table generated using make_table. In particular, it can compute the mean and/or quantiles of the distribution of r2.

A typical usage is

pyrho compute_r2 --tablefile <make_table_output> --samplesize <n> \
--quantiles 0.25,0.5,0.75 --compute_mean

which will compute the 25th, 50th, and 75th percentiles as well as the mean of the distribution of r2 for the lookup table stored in <make_table_output>. It is possible to compute statistics for a smaller sample size by setting <n> to be less than the sample size for which <make_table_output> was computed.

Example

The example folder contains a well-commented shell script example.sh which runs through a typical use-case, taking a VCF file and the output of smc++ and ultimately computing a fine-scale recombination map.

FAQ

Does pyrho return the population-scaled recombination rate "rho" or the per-generation recombination rate "r"?

Short answer: pyrho infers the per-generation, per-base recombination rate "r".

Long answer: internally pyrho infers "rho" and then uses the user-specified Ne to convert "rho" to "r". This scaling works well in humans, but has not been tested extensively in species with different mutation rates or ratios of mutation rate to recombination rate. Overall, differences across the genome will be stable (i.e., relative rates) but the absolute scaling should be taken with some degree of skepticism, especially when working with species whose evolutionary parameters differ from those of humans.

Help! pyrho optimize has been running for 3 days and is not done yet!

Missing data makes the default implementation of pyrho optimize run very slowly. To get reasonable runtimes either remove sites with more that a few missing genotypes or use the --fast-missing flag. The --fast-missing flag causes pyrho to only use individuals who are genotyped at both loci when calculating two-locus likelihoods. This effectively throws away the information provided by individuals who are genotyped at only one of the two loci. This information is not particularly useful for inferring recombination rates, and ignoring it can result in a significant speedup in the presence of missing data.

Citation

If you use pyrho please cite

Spence, J.P. and Song, Y.S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances, Vol. 5, No. 10, eaaw9206 (2019).

and

Kamm, J.A.*, Spence, J.P.*, Chan, J., and Song, Y.S. Two-locus likelihoods under variable population size and fine-scale recombination rate estimation. Genetics, Vol. 203 No. 3 (2016) 1381-1399.