popgenmethods / smcpp

SMC++ infers population history from whole-genome sequence data.
GNU General Public License v3.0
149 stars 32 forks source link

.. image:: https://travis-ci.org/popgenmethods/smcpp.svg?branch=master :target: https://travis-ci.org/popgenmethods/smcpp

SMC++ is a program for estimating the size history of populations from whole genome sequence data.

.. contents:: :depth: 2

Quick start guide

  1. Follow the installation instructions_.

  2. Convert your VCF(s) to the SMC++ input format with vcf2smc_::

    $ smc++ vcf2smc my.data.vcf.gz out/chr1.smc.gz chr1 Pop1:S1,S2

    This command will parse data for the contig chr1 for samples S1 and S2 which are members of population Pop1. You should run this once for each independent contig in your dataset, producing one SMC++ output file per contig.

  3. Fit the model using estimate_::

    $ smc++ estimate -o analysis/ 1.25e-8 out/example.chr*.smc.gz

      The first mandatory argument, 1.25e-8, is the per-generation mutation rate. The remaining arguments are the data files generated in the previous step. Depending on sample size and your machine, the fitting procedure should take between a few minutes and a few hours. The fitted model will be stored in JSON format in analysis/model.final.json.

  4. Visualize the results using plot_::

    $ smc++ plot plot.pdf analysis/model.final.json

SMC++ can also estimate and plot joint demographies from pairs of populations; see split_.

.. _latest release: https://github.com/popgenmethods/smcpp/releases/latest

Installation instructions

Beginning with v1.15.4, SMC++ is distributed as a Docker image. (Anaconda support has been discontinued.) To run the latest version of the program, type::

$ docker run --rm -v $PWD:/mnt terhorst/smcpp:latest [ARGUMENTS]

To run a specific version of the program, change latest to version- followed by the version number. For example, to run v1.15.4, type::

$ docker run --rm -v $PWD:/mnt terhorst/smcpp:version-1.15.4 [ARGUMENTS]

If docker is unavailable (for example, in a cluster environment where you do not have admin privileges), some users have reported success using singularity_ instead.

.. _singularity: https://sylabs.io/guides/3.7/user-guide/

Build instructions

SMC++ requires the following libraries and executables in order compile and run:

On Ubuntu (or Debian) Linux, the library requirements may be installed using the commmand::

$ sudo apt-get install -y python3-dev libgmp-dev libmpfr-dev libgsl0-dev

On OS X, the easiest way to install them is using Homebrew_::

$ brew install mpfr gmp gsl

After installing the requirements, SMC++ may be built by running::

$ pip install git+https://github.com/popgenmethods/smcpp

(Alternatively, git clone the repository and run the usual python setup.py install. You must clone. Downloading the source tarball will not work.)

.. _Homebrew: http://brew.sh .. _gmp: http://gmplib.org .. _mpfr: http://mpfr.org .. _gsl: https//www.gnu.org/software/gsl/

Note for OS X users

Versions of Clang shipping with Mac OS X do not currently support OpenMP_. In order to build SMC++ on OS X you must use a compiler that does, such as gcc::

$ brew install gcc
$ CC=gcc-6 CXX=g++-6 python setup.py install

.. _OpenMP: http://openmp.org

Virtual environment

SMC++ pulls in a fair number of Python dependencies. If you prefer to keep this separate from your main Python installation, or do not have root access on your system, you may wish to install SMC++ inside of a virtual environment_. To do so, first create and activate the virtual environment::

$ virtualenv -p python3 <desired location>
$ source <desired location>/bin/activate

Then, install SMC++ as described above.

.. _virtual environment: http://docs.python-guide.org/en/latest/dev/virtualenvs/

Usage

SMC++ comprises several subcommands which are accessed using the syntax::

$ smc++ <subcommand>

where <subcommand> is one of the following:

vcf2smc

This subcommand converts (biallelic, diploid) VCF data to the format used by SMC++.

Required arguments ^^^^^^^^^^^^^^^^^^

  1. An indexed VCF file <http://www.htslib.org/doc/tabix.html>_.
  2. An output file. Appending the .gz extension will cause the output to be compressed; the estimate_ command can read from both compressed and uncompressed data sources.
  3. A contig name. Each call to vcf2smc_ processes a single contig. VCFs containing multiple contigs should be processed via multiple separate runs.
  4. A list of population(s) and samples. Each population has an id followed by a comma-separated list of sample IDs (column names in the VCF). Up to two populations are supported.

For example, to convert contig chr1 of vcf.gz using samples NA12878 and NA12879 of population CEU, saving to chr1.smc.gz, use::

$ smc++ vcf2smc vcf.gz chr1.smc.gz chr1 CEU:NA12878,NA12879

Optional arguments ^^^^^^^^^^^^^^^^^^

Composite likelihood ^^^^^^^^^^^^^^^^^^^^ By varying -d over the same VCF, you can create distinct data sets for estimation. This is useful for forming composite likelihoods. For example, the following command will create three data sets from contig chr1 of myvcf.gz, by varying the identity of the distinguished individual and treating the remaining two samples as "undistinguished":

.. code-block:: bash

for i in {7..9}; 
    do smc++ vcf2smc -d NA1287$i NA1287$i myvcf.gz out.$i.txt chr1 NA12877 NA12878 NA12890; 
done

You can then pass these data sets into estimate_::

$ smc++ estimate -o output/ out.*.txt

SMC++ treats each file out.*.txt as an independently evolving sequence (i.e., a chromosome); the likelihood is simply the product of SMC++ likelihoods over each of the data sets. In the example above where the data sets are generated from the same chromosome but different distinguished individuals (different -d), this independence assumption is violated, leading to a so-called composite likelihood. The advantage of this approach is that it incorporates genealogical information from additional distinguished individuals into the analysis, potentially leading to improved estimates.

Since (a portion of) the computational and memory requirements of SMC++ scale linearly with the total analyzed sequence length, it is generally advisable to composite over a relatively small number of individuals. In practice we generally use 2-10 individuals, depending on genome length, sample size, etc., and have found that this leads to improved estimation without causing significant degeneracy in the likelihood.

Caveats ^^^^^^^ vcf2smc targets a common use-case but may not be sufficient for all users. In particular, you should be aware that:

Those wishing to implement their own custom conversion to the SMC++ data format should see the input data format_ description below.

estimate

This command will fit a population size history to data. The basic usage is::

$ smc++ estimate <mutation rate> <data file> [<data file> ...]

Required arguments ^^^^^^^^^^^^^^^^^^

  1. The per-generation mutation rate. Scientific notation is acceptable: use e.g. 1e-8 in place of .00000001.
  2. One or more SMC++-formatted data files, generated by vcf2smc_, for example.

Optional arguments ^^^^^^^^^^^^^^^^^^

A number of other arguments concerning technical aspects of the fitting procedure exist. To see them, pass the -h option to estimate.

cv

This command is similar to estimate, with the difference that it uses cross-validation to obtain sensible model parameters for use during estimation. The syntax and options for this command are nearly identical to estimate:

$ smc++ cv <mutation rate> <data file> [<data file> ...]

The optional --folds parameter can be used to specify the number of folds used for performing k-fold cross validation. The default is 2 and should be set higher in cases where you have more data.

Important: The cv command was not part of the original paper; in particular, it has not been peer-reviewed. Although it has proved useful in some contexts, manual parameter tuning may still be necessary.

plot

This command plots fitted size histories. The basic usage is::

$ smc++ plot plot.png model1.json model2.json [...] modeln.json

where model*.json are fitted models produced by estimate.

Required arguments ^^^^^^^^^^^^^^^^^^

  1. An output file-name. The output format is determined by the extension (.pdf, .png, .jpeg, etc.)
  2. One or more JSON-formatted SMC++ models (the output from estimate_).

Optional arguments ^^^^^^^^^^^^^^^^^^

split

This command fits two-population clean split models using marginal estimates produced by estimate_. To use split, first estimate each population marginally using estimate::

$ smc++ vcf2smc my.vcf.gz data/pop1.smc.gz <contig> pop1:ind1_1,ind1_2
$ smc++ vcf2smc my.vcf.gz data/pop2.smc.gz <contig> pop2:ind2_1,ind2_2
$ smc++ estimate -o pop1/ <mu> data/pop1.smc.gz
$ smc++ estimate -o pop2/ <mu> data/pop2.smc.gz

Next, create datasets containing the joint frequency spectrum for both populations::

$ smc++ vcf2smc my.vcf.gz data/pop12.smc.gz <contig> pop1:ind1_1,ind1_2 pop2:ind2_1,ind2_2
$ smc++ vcf2smc my.vcf.gz data/pop21.smc.gz <contig> pop2:ind2_1,ind2_2 pop1:ind1_1,ind1_2

Finally, run split to refine the marginal estimates into an estimate of the joint demography::

$ smc++ split -o split/ pop1/model.final.json pop2/model.final.json data/*.smc.gz
$ smc++ plot joint.pdf split/model.final.json

posterior

This command will export (and optionally visualize) the posterior distribution of the time to most recent common ancestor (TMRCA) in the distinguished pair from the given data set.

The output file is the result of::

>>> numpy.savez(output, hidden_states=hs, 
                **{'file1'=gamma1, 'file1_sites'=sites1, ...})

where:

There will be a gamma/sites entry for each data set decoded.

Required arguments ^^^^^^^^^^^^^^^^^^

Optional arguments ^^^^^^^^^^^^^^^^^^

cite

This command prints plain- and BibTex-formatted citation information for the accompanying paper_ to the console.

.. _accompanying paper: http://www.nature.com/ng/journal/vaop/ncurrent/ng.3748

Tips for using SMC++

SMC++ has several regularization parameters which affect the quality of the fits obtained using estimate and split. The default settings have proved useful for analyzing high coverage human sequence data from a few hundred individuals. For other types of data, you will likely need to experiment with different values of these parameters in order to obtain good estimates.

A useful diagnostic for understanding the final output of SMC++ are the sequence of intermediate estimates .model.iter<k>.json which are saved by --estimate in the --output directory. By plotting these, you can get a sense of whether the optimizer is overfitting and requires additional regularization.

Frequently asked questions

  1. The binary installer dies with the error message::

    ImportError: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ...).

    How can I fix this?

    Users of RedHat/CentOS clusters commonly report this error. It is due to a glibc version mismatch between your system and the build server used to create the binary installers. We are not able to create binaries for older versions of glibc. Your options are to either a) upgrade glibc on your system (which would probably require upgrading your operating system); or b) build SMC++ yourself by following the build instructions_. Please note that linking a different version of glibc at runtime is not supported, and will likely cause random crashes.

  2. SMC++ claims that my population crashed in the very recent past. What's going on?

    Typically this is due to long runs of homozygosity (ROH) in the data, which can arise for one of several reasons:

    1. The population legitimately experienced a recent crash, leading to inbreeding;
    2. One or more selective sweeps occured; or
    3. Uncalled regions in your VCF were not marked as such before running vcf2smc_.

    1 represents real signal, while #2 and #3 should be filtered out using the -m

    option of vcf2smc and/or the -c option of estimate.

  3. How do I get the estimated recombination rate?

    The model.final.json output file contains fields named rho and N0. rho is the estimated population-scaled recombination rate per base-pair. To convert it to units of generations, multiply by 2 * N0.

What to do if you encounter trouble

SMC++ is under active development and you may encounter difficulties in trying to use it. Always make sure that you have upgraded to the latest version <https://github.com/popgenmethods/smcpp/releases/latest>_, as the bug you have encountered may have already been fixed. If that does not work, then:

In both cases, you will receive a faster response if you include as much detail as possible about your data set (sample size, # of contigs, etc.), system and, where applicable, the .debug.txt log file saved by SMC++ in the output directory specified to the estimate command.

File formats

Input data format

The data files should be ASCII text and can optionally be gzipped. The format of each line of the data file is as follows::

<span> <d1> <u1> <n1> [<d1> <u2> <n2>]

Explanation of each column:

For example, consider the following set of genotypes at a set of 10 contiguous bases on three diploid individuals in one population::

dist.   ..1..N...2
        .....N...1
        2N....+...

The distinguished individual is row one. A . indicates that the individual is homozygous for the ancestral allele, while an integer indicates that that individual possesses (1,2) copies of the derived allele. An N indicates a missing genotype at that position. Finally, the + in column seven indicates that individual three possessed the ancestral allele on one chromosome, and had a missing observation on the other chromosome (this would be coded as 0/. in a VCF).

The SMC++ format for this input file is::

1   0   2   4
1   0   0   2
1   1   0   4
2   0   0   4
1   -1  0   2
1   0   0   3
2   0   0   4
1   2   1   4

The data files also include a custom metadata header with some additional information about the populations. For this reason, it is advised to please use the included vcf2smc_ tool in order to translate from the VCF to SMC format.

Output data format

Upon completion, SMC++ will write a JSON-formatted <https://en.wikipedia.org/wiki/JSON>_ model file into the into the analysis directory. The file is human-readable and contains various parameters related to the fitting procedure.