rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

QUILT

Current Version: 1.0.5 Release date: Sept 11, 2023

Build Status

Changes in latest version

  1. Be able to work with cram files

For details of past changes please see CHANGELOG.

QUILT is an R and C++ program for rapid genotype imputation from low-coverage sequence using a large reference panel.

QUILT-HLA is an R and C++ program for rapid HLA imputation from low-coverage sequence using a labelled reference panel.

Please use this README for general information about QUILT and QUILT-HLA, and specific information about QUILT. Please see the QUILT-HLA README for specific details about QUILT-HLA.

Table of contents

  1. Introduction
  2. Installation
    1. github
    2. conda
  3. Quick start run
  4. Input and output formats
    1. Input
    2. Output
  5. Help, options and parameters
  6. Separating reference panel processing from imputation
  7. Important parameters that influence run time and accuracy
  8. Examples
  9. License
  10. Citation
  11. Testing
  12. Bug reports

Introduction

QUILT is a program for rapid diploid genotype imputation from low-coverage sequence using a large reference panel. Statistically, the QUILT model works on a per-read basis, and is base quality aware, meaning it can accurately impute from diverse inputs, including noisy long read sequencing (e.g. Oxford Nanopore Technologies), and barcoded Illumina sequencing (e.g. Haplotagging). Accuracy using QUILT and lc-WGS meets or exceeds other methods for lc-WGS imputation, particularly for high diversity regions or genomes (e.g. MHC, or non-human species). Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations, with the potential for accuracy to nearly double at rare SNPs (e.g. 2.0X lc-WGS vs microarrays for SNPs at 0.1% frequency). Further details and detailed evaluations are available in the QUILT paper.

Installation

QUILT is available to download either through this github repository, or through conda.

github

First, install STITCH, installed in a similar way to QUILT, as specified on the STITCH website here. Next, install QUILT, as follows

git clone --recursive https://github.com/rwdavies/QUILT.git
cd QUILT
./scripts/install-dependencies.sh
cd releases
wget https://github.com/rwdavies/quilt/releases/download/1.0.5/QUILT_1.0.5.tar.gz ## or curl -O
R CMD INSTALL QUILT_1.0.5.tar.gz

conda

QUILT (as r-quilt) can be installed using conda. Full tutorials can be found elsewhere, but briefly, something like this should work

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda install r-quilt -c defaults -c bioconda -c conda-forge
source activate
R -e 'library("QUILT")'

Note that currently the command like QUILT.R is not included with the bioconda installation, so from the command line, you can either run something like R -e 'library("QUILT"); QUILT(chr="chr19", etc)', or clone the repo to get QUILT.R.

Quick start run

A quick start to ensure QUILT is properly installed and working can be performed using the following

Download example data package, containing 1000 Genomes haplotypes, and NA12878 bams

wget http://www.stats.ox.ac.uk/~rdavies/QUILT_example_2021_01_15A.tgz ## or curl -O
tar -xzvf QUILT_example_2021_01_15A.tgz

Perform imputation. Note that reference panel data can be processed separately to speed up repeated imputation of the same region in independent jobs, see Separating reference panel processing from imputation.

rm -r -f quilt_output
./QUILT.R \
--outputdir=quilt_output \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=2100000 \
--buffer=10000 \
--bamlist=package_2021_01_15A/bamlist.1.0.txt \
--posfile=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.posfile.txt \
--phasefile=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.phasefile.txt \
--reference_haplotype_file=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.noNA12878.hap.gz \
--reference_legend_file=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.noNA12878.legend.gz \
--genetic_map_file=package_2021_01_15A/CEU-chr20-final.b38.txt.gz \
--nGen=100 \
--save_prepared_reference=TRUE

Succesful completion of this run results in a VCF at quilt_output/quilt.chr20.2000001.2100000.vcf.gz. For a slightly longer version of this example, see Examples

Input and output formats

Input

For all of these, it can be useful to take a look at the example files provided as part of the quick start example above.

Output

Note that in QUILT, genotype posteriors (GP) and dosages (DS) are taken from the main Gibbs sampling, while the phasing results (GT and HD) are taken from an additional special phasing Gibbs sample. As such, phasing results (GT and HD) might not be consistent with genotype information (GP and DS). If consistency is necessary, note that you can create a consistent GP and DS from HD.

Per-SNP annotation is available as follows

##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">,
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Posterior genotype probability of 0/0, 0/1, and 1/1">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Diploid dosage">
##FORMAT=<ID=HD,Number=2,Type=Float,Description="Haploid dosages">

SNP annotation information

##INFO=<ID=EAF,Number=.,Type=Float,Description="Estimated allele frequency">
##INFO=<ID=HWE,Number=.,Type=Float,Description="Hardy-Weinberg p-value">
##INFO=<ID=ERC,Number=.,Type=Float,Description="Estimated number of copies of the reference allele from the pileup">
##INFO=<ID=EAC,Number=.,Type=Float,Description="Estimated number of copies of the alternate allele from the pileup">
##INFO=<ID=PAF,Number=.,Type=Float,Description="Estimated allele frequency using the pileup of reference and alternate alleles">

Separating reference panel processing from imputation

For large reference panels, and for many jobs involving imputing few samples, it can be computationally efficient to pre-process the reference panel and save the output, and use this output for multiple independent runs. Here is an example for how we would do this, for the case of the quick start example. Note that any parameters available jointly in QUILT and QUILT_prepare_reference that inform how the reference panel is processed must be set in QUILT_prepare_reference (for example, maxRate bounds the recombination rate, and must be set when runningQUILT_prepare_reference as the recombination rate is processed in this step).

First, to re-format the reference panel

rm -r -f quilt_output
./QUILT_prepare_reference.R \
--outputdir=quilt_output \
--chr=chr20 \
--nGen=100 \
--reference_haplotype_file=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.noNA12878.hap.gz \
--reference_legend_file=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.noNA12878.legend.gz \
--genetic_map_file=package_2021_01_15A/CEU-chr20-final.b38.txt.gz \
--regionStart=2000001 \
--regionEnd=2100000 \
--buffer=10000

Second, to perform imputation

./QUILT.R \
--outputdir=quilt_output \
--chr=chr20 \
--regionStart=2000001 \
--regionEnd=2100000 \
--buffer=10000 \
--bamlist=package_2021_01_15A/bamlist.1.0.txt \
--posfile=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.posfile.txt \
--phasefile=package_2021_01_15A/ALL.chr20_GRCh38.genotypes.20170504.chr20.2000001.2100000.phasefile.txt

Note that when running multiple versions of QUILT against the same reference data, it is useful to set output_filename to change the default filename for each job, and to keep the temporary directories used independent (which is the behaviour for default tempdir).

Help, options and parameters

For a full list of options, query ?QUILT::QUILT, or alternatively, type

./QUILT.R --help

Important parameters that influence run time and accuracy

These parameters are most likely to influence run time and accuracy

Examples

License

QUILT and the code in this repo is available under a GPL3 license. For more information please see the LICENSE.

Citation

Davies, R. W., Kucka M., Su D., Shi S., Flanagan M., Cunniff C. M., Chan Y. F. , Myers S. Rapid genotype imputation from sequence with reference panels. In press, Nature Genetics

Testing

Tests in QUILT are split into unit or acceptance run using ./scripts/test-unit.sh and ./scripts/test-acceptance.sh. To run all tests use ./scripts/all-tests.sh, which also builds and installs a release version of QUILT. To make compilation go faster do something like export MAKE="make -j 8".

Bug reports

The best way to get help is to submit a bug report on GitHub in the Issues section. Please also use the Issues section if you have a more general question, such issues will be left open for others to see. Similarly, please check the issues before posting to see if your issue has already been addressed

For more detailed questions or other concerns please contact Robert Davies robertwilliamdavies@gmail.com