molgenis / VaSeBuilder

Validation Set Builder
GNU Lesser General Public License v3.0
1 stars 3 forks source link

VaSeBuilder

Validation Set Builder  

 

Quick Start

Inputs and Outputs

The workflow for VaSeBuilder generally involves as inputs:

At the end of a full run, VaSeBuilder will produce:

Workflows

VaSeBuilder has two overall workflows:

1) Full workflow from inputs to outputs (BuildValidationSet).

Command Line Examples

For a full workflow:

python vase.py BuildValidationSet \  
  -b donor1.bam donor2.bam donor3.bam \
  -v donor1.vcf.gz donor2.vcf.gz donor3.vcf.gz \
  -f my_variants_of_interest.tsv
  -a acceptor.bam \
  -1 acceptor_R1.fastq.gz -2 acceptor_R2.fastq.gz \
  -r reference_genome.fasta

To construct spike-in building blocks for the scalable workflow:

python vase.py BuildSpikeIns \
  --output-mode P \
  -b donor1.bam donor2.bam donor3.bam \
  -v donor1.vcf.gz donor2.vcf.gz donor3.vcf.gz \
  -f my_variants_of_interest.tsv \
  -a acceptor.bam \
  -r reference_genome.fasta

To compile spike-ins into a hybrid FastQ:

python vase.py AssembleValidationSet \
  -kb spike_in_1.bam spike_in_2.bam spike_in_3.bam \
  -kv spike_in_1.vcf.gz spike_in_2.vcf.gz spike_in_3.vcf.gz \
  -1 acceptor_R1.fastq.gz -2 acceptor_R2.fastq.gz \
  -c varcon_file.tsv

About VaSeBuilder

Short introduction

Often when a computational analysis consisting of multiple steps is run routinely, a pipeline is constructed. The pipeline connects the individual analysis steps and automates the execution of each and transition to the next. Pipelines can be evaluated to determine whether each step and transition to the next works correctly. Simultaneously, validation of the pipeline to determine whether it can indeed provide relevant results is equally important. Pipelines can be validated in separate and isolated runs but this requires time. We aim to combine the running of analyses and validation of the pipeline.\ VaSeBuilder (Validation Set Builder) can be used to construct two FastQ files with forward and reverse reads for the validation of the Molgenis NGS_DNA pipeline. To construct these two FastQ files data from samples already processed with the NGS_DNA pipeline are used to modify a template.\ The sample data should consist of a BAM file (containing aligned reads) and a VCF file containing identified variants.\ The template can for example be the NA12878 sample and should first be processed with the NGS_DNA pipeline. VaSeBuilder only requires the two FastQ and the produced BAM file.

What does VaSeBuilder do?

For each sample, VaSeBuilder iterates over the variants within the VCF/BCF file (if a variant list is provided only variants satisfying that list will be used). First reads overlapping directly with the variant are identified in both the acceptor/template and the donor BAM/CRAM file of the same sample. From these reads and their mates the left and right most position is determined which establishes the acceptor context and donor context respectively. From these two contexts the variant context is determined. This context spans the absolute left and right most genomic positions of both contexts and can (quite often) be larger than either. Reads overlapping with this variant context and their mates are then identified and saved.\ After processing all donor samples, the acceptor/template fastq files are used to produce the validation set fastq files. Acceptor reads within a variant contexts are excluded from these new fastq files and are replaced with donor reads that are located in the same variant context. This produces a set of fastq files for with known variants that should be able to be identified and which reads carry those variants.\ Currently VaSeBuilder only works with 'simple' genomic variants such as SNPs and small indels, but this may very well be expanded in the future :)

What are acceptor, donor and variant contexts?

VaSeBuilder creates the validation set by identifying for each variant which acceptor reads to be exchanged with which donor reads. The acceptor context is the window established by the leftmost and rightmost genomic positions of reads directly overlapping with the variants and their read mates (which likely do not overlap). The donor context works the same as the acceptor context but instead uses a donor file.\ The variant context is established by combining the minimum and maximum border of acceptor and donor context and thereby spans both. Since the variant context spans a larger area than both acceptor and donor context individually, acceptor and donor reads and their mates are identified again. Acceptor reads overlapping with the variant context and their mates are excluded and replaced by donor reads overlapping with the variant context.

Required software

Important to know

VaSeBuilder is intended to build a validation set from acceptor and donor data that was sequenced with the same sequencer/sequencing platform and treated with the same preparation and capturing kit. (Please see the documentation later as to why...)  

 

Program output

Aside from the forward and reverse read FastQ files, the program also outputs a set of text files. These files are all related to the variant contexts. Although the names of some files can be set via the optional parameters, below we used the default names to describe each.

Default output files

Debug output files

 

Questions & Answers

General questions

Q: Which python version is required?\ A: VaSeBuilder has been created to run in Python 3. To run the program on Python 2 the code might need to be changed first.

Q: Which python libraries are required?\ A: VaSeBuilder uses pysam and gzip. The easiest way to obtain these libraries is to install them via pip (pip install library). Note that the pysam library itself requires htslib which doesn't run on the Windows platform.

Q: Which pysam version is required?\ A: Pysam 0.14.0 or higher is required as VaSeBuilder uses the get_forward_sequence() and get_forward_qualities() methods that were introduced in pysam 0.14.0.

Q: Can VaSeBuilder be run on Windows?\ A: Some adjustments would need to be made as pysam can't be used on Windows. The python library bamnostic can replace pysam for working with BAM files. Reading and processing VCF would need to be done with the default python file reading/writing functionality or other libraries (if available).

Q: Are sample identifiers required?\ A: Yes, VaSeBuilder uses the sample identifiers in BAM and VCF files to link the two files. You can check if a BAM has sample information by examing the header for a line similar to: @RG ID:Sample SM:Sample You can extract the header (samtools view -H bamFile.bam > bamHeader.txt), add a sample by adding the above line with the correct sample name and then reheader the BAM file (samtools reheader bamHeader.txt bamFile.bam > BamFileWithSample.bam).

Specific questions

Q: Is there a simple way to check whether the donor and acceptor reads are indeed added and removed respectively?\ A: Currently I'm working on a set of scripts, as well as others, that I have called VaSeUtils which can be used to check this.

Q: What is the effect of using data from different sequencers, prep-kits and/or capturing kits?\ A: To still be investigated...  

 

Future things

VaSeBuilder might change into a python project installable via pip.\ VaSeBuilder might very well be updated and changed to work with more complex variants.\ Make VaSeBuilder into a multiprocessor program to speed up te process.\ Let VaSeBuilder work with more complex variants.\ Let VaSeBuilder work with trio data.

Please let me know if this documentation is missing things or which things are unclear.