koadman / graphdeconvolution

deconvolution of species & strain genome mixtures in assembly graphs
5 stars 0 forks source link

assembly graph deconvolution

A repository for work on deconvoluting assembly graphs containing strain mixtures using time-series abundance information.

Quick start

Assuming you are starting in a directory containing a collection of paired-end Illumina read files, with names ending in the usual R?.fastq.gz, the software can be run as follows:

export BBMAP=/path/to/bbmap
export GDECONHOME=/path/to/this/repo
find `pwd` -maxdepth 1 -name "*R1.fastq.gz" | sort > read1_files.txt
find `pwd` -maxdepth 1 -name "*R2.fastq.gz" | sort > read2_files.txt
$GDECONHOME/btools/gdecon.nf --readlist1=read1_files.txt  --readlist2=read2_files.txt --r1suffix=R1.fastq.gz

If the workflow has completed successfully then the assembled strain genomes will appear in the directory out/.

Dependencies and prerequisites

Major components of the workflow

The workflow first constructs a compacted de Bruijn graph using bcalm, then trims the dBg using btrim. The abundance of each path (unitig) in each sample is then estimated via tigops. Next, the abundance data and graph structure are given to a Bayesian graph deconvolution model. The posterior output from the model is summarised and the number of strain genomes as well as their sequences is then recorded.

Bayesian assembly graph deconvolution implemented in Stan

A model has been implemented in Stan code to carry out assembly graph deconvolution. The input file for the model must contains the following information about the unitig graph:

The MEGAHIT hack

We also created a hacked-up version of megahit, designed to take multiple samples. It uses the same command-line interface as standard megahit. Two additional output files are generated in addition to the usual megahit outputs. The first is intermediate_contigs/k*.unitigs.fa, which contains the unitig sequences. This can optionally be processed with the megahit toolkit contigs2fastg program to generate a graph viewable in BANDAGE. The second output is intermediate_contigs/k*.unitig_depths.Rdata. This file is as described above.