convert bam to gvcf - Githubissues

jharri34 commented 7 years ago

HaplotypeCaller Call germline SNPs and indels via local re-assembly of haplotypes Overview The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the so-called GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate genomic gVCF (gVCF), which can then be used for joint genotyping of multiple samples in a very efficient way, which enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).

In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use MuTect2 instead.

Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers.

Input Input bam file(s) from which to make calls

Output Either a VCF or gVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (best) or hard-filtering before use in downstream analyses. If using the reference-confidence model workflow for cohort analysis, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.

Usage examples Single-sample GVCF calling on DNAseq (for -ERC GVCF cohort analysis workflow) java -jar GenomeAnalysisTK.jar \ -R reference.fasta \ -T HaplotypeCaller \ -I sample1.bam \ --emitRefConfidence GVCF \ [--dbsnp dbSNP.vcf] \ [-L targets.interval_list] \ -o output.raw.snps.indels.g.vcf Caveats We have not yet fully tested the interaction between the GVCF-based calling or the multisample calling and the RNAseq-specific functionalities. Use those in combination at your own risk. Many users have reported issues running HaplotypeCaller with the -nct argument, so we recommend using Queue to parallelize HaplotypeCaller instead of multithreading. Special note on ploidy This tool is able to handle almost any ploidy (except very high ploidies in large pooled experiments); the ploidy can be specified using the -ploidy argument for non-diploid organisms.

jharri34 commented 7 years ago

Once you have pre-processed your data according to our recommendations, you are ready to undertake the variant discovery process, i.e. identify the sites where your data displays variation relative to the reference genome, and calculate genotypes for each sample at that site.

Unfortunately some of the variation you might observe is caused by mapping and sequencing artifacts, so the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). We have found that it is very difficult to reconcile these objectives in a single step, so instead we decompose the variant discovery process into two separate steps: variant calling and variant filtering. The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

For DNA, the variant calling step is further subdivided into two separate steps (per-sample calling followed by joint genotyping across samples) in order to enable scalable and incremental processing of cohorts comprising many individual samples. This workflow involves running HaplotypeCaller on each sample separately in GVCF mode, to produce an intermediate file format called GVCF (for Genomic VCF), which is described in detail in the documentation. The GVCFs of multiple samples are then run through a joint genotyping step to produce a multi-sample VCF callset, which can then be filtered to balance sensitivity and specificity as desired. In the final analysis, this workflow produces results equivalent to traditional joint calling, in which all samples are given simultaneously to the variant caller, but it scales much better and resolves the so-called N+1 problem. Note that this is equally applicable to small cohorts or even single samples.

The best way to filter the resulting variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real. The drawback of this sophisticated method is that it requires a large callset (minimum 30 exomes, more than 1 whole genome if possible) and highly curated sets of known variants. This makes it difficult to apply to small experiments, RNAseq experiments, and non-model organisms; for those it is typically necessary to develop hard filtering parameters manually.

jharri34 commented 7 years ago

you can use the HaplotypeCaller to call variants individually per-sample in -ERC GVCF mode, followed by a joint genotyping step on all samples in the cohort

jharri34 commented 7 years ago

we call incremental joint discovery, providing you with all the benefits of classic joint calling (as described below) without the drawbacks.

jharri34 commented 7 years ago

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

jharri34 commented 7 years ago

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

jharri34 commented 7 years ago

need reference.fasta

jharri34 / krueger

convert bam to gvcf #12