Multi-steps pipeline dedicated to genetic imputation from simulation to validation
nf-core/phaseimpute is a bioinformatics pipeline to phase and impute genetic data. Different steps are available each corresponding to a dedicated modes.
The phaseimpute pipeline is constituted of 5 main steps:
Metro map | Modes |
---|---|
- Panel preparation: Phasing, QC, variant filtering, variant annotation of the reference panel - Imputation: Impute the target dataset on the reference panel - Simulate: Simulation of the target dataset from high quality target data - Concordance: Concordance between the target dataset and a truth dataset |
[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with
-profile test
before running the workflow on actual data.
The basic usage of this pipeline is to impute a target dataset based on a phased panel. First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv
:
sample,file,index
SAMPLE_1X,/path/to/.<bam/cram>,/path/to/.<bai,crai>
Each row represents a bam or a cram file with its index file. All input files need to be of the same extension. For some tools and steps, you will also need to submit a samplesheet with the reference panel.
A final samplesheet file for the reference panel may look something like the one below. This is for 3 chromosomes.
panel,chr,vcf,index
Phase3,1,ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.csi
Phase3,2,ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.csi
Phase3,3,ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz,ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.csi
Now, you can run the pipeline using:
nextflow run nf-core/phaseimpute \
-profile <docker/singularity/.../institute> \
--input <samplesheet.csv> \
--genome "GRCh38" \
--panel <phased_reference_panel.csv> \
--steps "panelprep,impute" \
--tools "glimpse1" \
--outdir <OUTDIR>
[!WARNING] Please provide pipeline parameters via the CLI or Nextflow
-params-file
option. Custom config files including those provided by the-c
Nextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
Here is a short description of the different steps of the pipeline. For more information please refer to the documentation.
steps | Flow chart | Description |
---|---|---|
--panelprep | The preprocessing mode is responsible to the preparation of the multiple input file that will be used by the phasing process. The main processes are : - Haplotypes phasing of the reference panel using Shapeit5. - Normalize the reference panel to select only the necessary variants. - Chunking the reference panel in a subset of region for all the chromosomes. - Extract the positions where to perform the imputation. |
|
--impute | The imputation mode is the core mode of this pipeline. It is constituted of 3 main steps: - Imputation: Impute the target dataset on the reference panel using either: - Glimpse1: It's come with the necessety to compute the genotype likelihoods of the target dataset (done using BCFTOOLS_mpileup). - Glimpse2 - Stitch This steps does not require a reference panel but needs to merge the samples. - Quilt - Ligation: all the different chunks are merged together then all chromosomes are reunited to output one VCF per sample. |
|
--simulate | The simulation mode is used to create artificial low informative genetic information from high density data. This allow to compare the imputed result to a truth and therefore evaluate the quality of the imputation. For the moment it is possible to simulate: - Low-pass data by downsample BAM or CRAM using SAMTOOLS_VIEW -s at different depth. |
|
--validate | This mode compare two vcf together to compute a summary of the differences between them. This step use Glimpse2 concordance process. |
To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
nf-core/phaseimpute was originally written by Louis Le Nézet.
We thank the following people for their extensive assistance in the development of this pipeline:
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #phaseimpute
channel (you can join with this invite).
You can cite one of the main imputation methods (QUILT
) as follows:
Rapid genotype imputation from sequence with reference panels.
Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., Chan, Y. F., & Myers, S.
Nature genetics 2021 June 03. doi: 10.1038/s41588-021-00877-0
You can cite one of the main imputation methods (GLIMPSE
) as follows:
Efficient phasing and imputation of low-coverage sequencing data using large reference panels.
Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O.
Nature genetics 2021. doi:[]()
Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes
Rubinacci, S., Hofmeister, R. J., Sousa da Mota, B., & Delaneau, O.
Nature genetics 2023. doi:[]()
You can cite one of the main imputation methods (STITCH
) as follows:
Rapid genotype imputation from sequence without reference panels.
Davies, R. W., Flint, J., Myers, S., & Mott, R.
Nature genetics 2016 . doi: []().
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.