mpiersonsmela / SeqVerify

Analyzes whole genome sequencing data for gene-editing verification
GNU General Public License v3.0
9 stars 3 forks source link

SeqVerify

SeqVerify is a Python-based command line tool for analysis of whole genome sequencing data for gene-editing verification. It performs insertion site detection, copy number variation (CNV) analysis through CNVPytor, bacterial contamination detection through KRAKEN2 and BRACKEN, and variant calling and filtering aided by SnpEff and SnpSift.

Install

Dependencies (& Dependencies of Dependencies)

Install through bioconda

conda install -c bioconda seqverify

Usage

An example SeqVerify call can be found below:

seqverify --output output_name --reads_1 sample_1.fastq --reads_2 sample_2.fastq --genome genome.fa --untargeted transgenes.fa --targeted commands.txt --database db8gb

Anatomy of a SeqVerify call

SeqVerify has the following standard arguments:

SeqVerify has the following optional arguments:

Performance

This option may be useful for core optimization on clusters: e.g. "beginning" is a single-core operation, "align" is a multi-core operation, so on a job scheduler a user could set a job dependent on the other and only use multiple cores when rquired. It may also be used to avoid having to restart the pipeline from scratch should the hardware or the software fail for any reason, or to perform further analysis on data that SeqVerify has already processed (e.g. add KRAKEN2 analysis when it wasn't initially requested).

KRAKEN2

Output

SeqVerify will output two folders, output_seqverify and output_seqverify_temp where "output" is the variable set in the --output argument.

output_seqverify_temp contains files that can be deleted after the pipeline is run, that are generated and used by the pipeline (SAM headers, .BED files, etc.).

soutput_seqverify contains the following files:

seqverify_output_markers.bam and seqverify_output_markers.bam.bai A BAM file containing the reads given in --reads_1 and --reads_2 realigned to the reference genome augmented with the marker genomes and its corresponding index.

seqverify_readout.txt The human-readable readout resulting from the insertion site detection portion of the pipeline. The outermost indentation represents the marker, the second indentation is the human chromosome it aligns to, and the third layer denotes the in-chromosome coordinate of the insertion site, and how many times the marker was aligned to it (with the minimum value being specified in --min_matches).

output.pytor The binaries output by CNVPytor. Can be used to generate further plots of specific regions if needed (refer to CNVPytor docs)

output.global.0000.png The Manhattan plot of copy number across the sample as output from output.pytor.

fig_marker.png A copy number histogram of the transgene in the genome. Default bin size is 30bp, and one histogram is output for every sequence specified in --markers.

If KRAKEN2 analysis is being performed, the pipeline will also output the following files in seqverify_output:

classified_seqs_output.kreport A human-readable report of the microbial sequences detected by KRAKEN2 in text format. Typically species with just a few reads (<10) can be ignored, but the presence of more than this could indicate contamination.

classified_output.kraken The KRAKEN2 binary output files used to generate the report.

classified_seqs_output_1.fq and classified_seqs_output_2.fq, FASTQ files containing the sequences that were classified in the KRAKEN2 database.

If SNV Analysis is being performed, the pipeline will output the following files:

seqverify_output.ann.vcf is the annotated VCF output of all variants present in the sample reads provided.

seqverify_output_variants.tsv is the filtered readout prepared by SeqVerify containing information about all variants above a set severity level, along with any loss of function or homozgosity data.

Example input and output

Example of WGS data with copy number abnormalities at the beginning of chromosome 12 (CNVPytor Manhattan plot):

S09_CNV

Example of the KRAKEN report showing Mycoplasm contamination in the same sample:

contamination

Example of insertion site detection, with tdTomato being found in chromosome 5 of a different sample, and possibly other transgenes (even though given the low number of matches it is more likely they are false positives due to repetitive DNA):

insertion

Tutorial

After installing SeqVerify from Bioconda, you will need to assemble the following files to run it:

You should then pick a directory to run the pipeline in, from which you'll call the seqverify command. This will create two folders inside of it, the output folder and the temp folder (the latter of which will be deleted by default).

seqverify --output output_name --reads_1 r1.fastq --reads_2 r2.fastq --genome genome.fa --untargeted transgene1.fa transgene2.fa (--targeted commands.txt) (--kraken --database database)

The pipeline will then run: this will usually take a few hours, with the most time-intensive steps being BWA alignment and genome indexing. The output files as described above will be output in the output folder, and everything else in the temp folder will be deleted unless --keep_temp is set.

Command text file generation

To use the --targeted flag for known insertion sites of transposons, SeqVerify requires the construction of a commands file. The commands file should be a text file and must be formatted as follows:

CHR:START-END SEQUENCE

where the two columns are tab-separated. Every line is a "command" and specifies a modification to the genome.

CHR is the name of the chromosome that the modification will happen at (e.g. chr1).

START-END is an interval of coordinates to be deleted, not including START.

SEQUENCE is the sequence to be inserted from START onwards.

For example, the command chr5:56644830-56644850 GGCTCTGGCGAGGGCAGAGGAAGTC deletes bases 56644831 to 56644850 in chromosome 5 and inserts the sequence GGCTCTGGCGAGGGCAGAGGAAGTC in their place.

Pure insertions (i.e. inserting while deleting nothing) can be achieved by putting the same coordinate for both start and end: chr1:1-1 AGCT deletes nothing and inserts AGCT after the first base. Pure deletions (i.e. deletions with no insertions) can be achieved by leaving the sequence field blank: chr2:0-10 deletes the first 10 bases in chromosome 2 and does not replace them with anything.

Multiple commands are allowed in one command file, and seqverify automatically handles interactions between commands on the same chromosome (i.e. a command's coordinates changing because of a previous command), so all the end user needs to do is use the coordinates straight from their source without any adjustment or calculation.

Config File

SeqVerify allows for the use of a configuration file to load all of its settings: a blank/default configuration file is bundled with a SeqVerify download under the name seqverify.config.

Please modify this file without changing the text between square brackets, as modification of the headers in the square brackets will break the file's parsing in the pipeline.

Additionally, in the config file please make sure your untargeted insertion file names do not contain spaces: the config file parser uses spaces to separate file names, so this will cause errors in file name parsing and may result in the pipeline not running correctly.

Spurious Threshold Parameter

SeqVerify allows for a --spurious_filtering_threshold parameter to be set to control how tightly the filtering of extremely high coverage (and therefore likely spurious for actual insertions) sites is performed. Filtered areas will not be printed out in the insertion site readout, but will be printed to stdout as the pipeline is running for logging purposes.

By default, this is set to 0.00001, meaning the pipeline filters out any area with a coverage higher than what we would expect 1-0.00001 = 99.9999% of the areas in the genome to have. This feature can be turned off by setting --spurious_filtering_threshold 0, in which case no filtering will be performed and every insertion site will be reported upon.

Frequently Asked Questions

Why am I getting a BCFTools error?

Certain builds of bcftools may not be compatible with all Linux distributions. If the pipeline raises “bcftools: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory”, please try forcing Conda to use the Bioconda channel to install bcftools. You can do this by setting your .condarc file to:

channel_priority: strict
channels:
  - bioconda
  - conda-forge
  - defaults

Can multiple fastq files for the same genome be used in the same command?

No, --reads_1 and --reads_2 can each only take a single fastq file. If you have more than one fastq file for the same genome and want to use SeqVerify on your data, please concatenate all the forward reads together in one file and all the backward reads together in another file through the cat command.

Are insertions of human DNA detected in the insertion site detection portion of the pipeline?

Insertions of mitochondrial human DNA can be detected using the --mitochondrial flag. Insertions of non-mitochondrial human DNA are not currently detectable in SeqVerify, but other tools (such as IGV) can be used to find them manually.

Are insertions of mitochondrial DNA a sign of contamination/degraded data?

Not necessarily: chromosomal insertions of mitochondrial human DNA are relatively common within the human genome, and are usually not signs of contamination unless unexpectedly found at a known insertion site.