imGLAD is a computational tool for detection of bacterial genomes in metagenomic datasets. For license information, see LICENSE.
The software consists of two parts the first part creates a series of metagenomic datasets, the datasets are created in such a way that the target organism is present in half of them and absent in the other half.
Python 3.4 https://www.python.org/downloads/release/python-370/
ART 2.5.8 or higher https://www.niehs.nih.gov/research/resources/software/biostatistics/art/.
Either BLAST 2.2.28 (or higher) https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download or BLAT (any version) [http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads] (http://hgdownload.soe.ucsc.edu/downloads.html#source_downloads).
numpy, scipy, biopython, gzip, screed, statsmodels (requires cython installation if using python older than 3.4)
Clone the git repository
$> git clone https://github.com/jccastrog/imGLAD
You can also download the zip file from the GitHub site https://github.com/jccastrog/imGLAD.
Once you have installed imGLAD you can use fitModel to create a model of the target genome you want to detect.
The automatic training generates reads form a randomly selected number of genomes (default is 200 genomes) from RefSeq (Pruitt et al., 2004), and builds in-silico-generated datasets of about 1 million reads each. Simulated reads from the target genome(s) are then generated in a similar way, and added to the former datasets, at different abundances, in order to create the positive datasets. Reads from the target genome(s) are omitted for the construction of negative datasets. All other genomes used to create the datasets are sampled in equal proportions (i.e., even richness).
Once the logistic model has been built, sequencing breadth can be used to reliably predict the probability of presence of the target genome in any number of query metagenomic datasets, using probEstimate.