janka2012 / digIS

Pipeline detecting distant and putative novel insertion sequences in prokaryotic genomes
MIT License
9 stars 0 forks source link

digIS

Table of content

Overview

digIS is a command-line tool developed in Python. It utilizes several external tools such as BLAST, HMMER, and Biopython library. As an input, digIS accepts contigs in FASTA format. Optionally, the user can provide a GenBank annotation file for a given input sequence(s). This annotation is later used to improve the classification of identified IS elements.

The digIS search pipeline operates in the following steps:

  1. The whole input nucleic acid sequence is translated into amino acid sequences (all six frames).
  2. The translated sequences are searched using manually curated pHMMs.
  3. Found hits, referred to as seeds, are filtered by domain bit score and e-value. Those that overlap or follow one another within a certain distance are merged.
  4. Each seed is matched against the database of known IS elements (ISfinder) and its genomic positions are extended according to the best hit.
  5. Extended seeds are filtered by noise cutoff score and length. Duplicates, corresponding to the same IS element, are removed.
  6. Remaining extended seeds are classified based on sequence similarity and GenBank annotation (if available) to assess their quality.
  7. Finally, the classified outputs are reported in the CSV and GFF3 format.

Requirements

Installation

Install dependencies using package manager (for Ubuntu)

sudo apt-get update
sudo apt-get install hmmer
sudo apt-get install ncbi-blast+

# install python3  and pip3
sudo apt-get install python3.7
sudo apt install python3-pip

# install Biopython, bcbio-gff, and numpy packages
pip3 install biopython
pip3 install bcbio-gff
pip3 install numpy

Download digIS version from github repository

# download the latest version
git clone https://github.com/janka2012/digIS.git

# or download specific release from https://github.com/janka2012/digIS/releases
wget https://github.com/janka2012/digIS/archive/v1.0.tar.gz
tar -xvzf v1.0.tar.gz

# go to digIS directory
cd digIS-v1.0

Usage

Mode with GenBank annotation

export PYTHONPATH=$PYTHONPATH:/path/to/digis/
python3 digIS_search.py -i data/test_data/NC_002608.fasta -g data/test_data/NC_002608.gb -o digis_genbank

Mode without GenBank annotation

export PYTHONPATH=$PYTHONPATH:/path/to/digis/
python3 digIS_search.py -i data/test_data/NC_002608.fasta -o digis_without_genbank

Run digIS in docker container

Install docker

# update software repositories
sudo apt-get update

# uninstall older versions of docker
sudo apt-get remove docker docker-engine docker.io

# install docker
sudo apt install docker.io

# start and automate docker
sudo systemctl start docker
sudo systemctl enable docker

# check docker version (optional)
docker --version

Pull the docker image from Dockerhub

docker pull janka2012/digis

# List created docker images. You should see image with name digis listed.
docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
janka2012/digis     latest              1f09fc937ee1        14 minutes ago      765MB

Run digIS using digis_docker_wrapper.sh

Instead of typing overwhelmingly long docker commands we are providing digis_docker_wrapper.sh script which allows you to use digIS docker image in really convinient way. This script takes same arguments as standard digIS.py script.

sh digis_docker_wrapper.sh -i data/test_data/NC_002608.fasta -g data/test_data/NC_002608.gb -o digis_genbank

Understanding Outputs

digIS output directory structure

digIS stores all results in output directory you specify by using -o option. The default output directory name is set to digIS_output. The output directory has following structure:

digIS output files in results subfolder

For a given input (multi)fasta file digIS generates three files with results: CSV file, GFF3 file and file with summary statistics.

CSV output

Example CSV output

id;level;qid;qstart;qend;sid;sstart;send;strand;acc;score;evalue;ORF_sim;IS_sim;GenBank_class
NC_002608.1_000_is;is;IS200_IS605;1;113;NC_002608.1;181742;183592;-;0.98;105.7;4.9e-34;0.8923076923076924;0.8803465078505684;is_related
NC_002608.1_001_is;is;IS200_IS605;1;113;NC_002608.1;154295;156130;-;0.98;117.5;1.1e-37;0.9922480620155039;1.0;is_related

GFF output

The GFF3 output file has the same content as the CSV output file, but in GFF3 format. Detailed description of GFF3 format can be found here.

Example GFF3 output record

NC_002608.1     digIS   transposable_element    309812  311213  1.0     -       .       id=NC_002608.1_13_is;level=is;qid=IS4Sa_ISH8_IS231_IS4;qstart=1;qend=226;class_sim_orf=strong;class_sim_is=strong;class_sim_all=strong;class_genebank=None;class_level=None

Summary statistics

This file is in tab-delimited format and contains six coloumns:

Getting FASTA file using GFF file

The GFF is a standard format for storing of genome features. This file can be used as an input for other tools to process or visualize appropriate genomic features.

For instance, FASTA sequences of IS elements (their catalytic domains) can be obtained using Bedtools and command getfasta as follows:

bedtools getfasta -fi <input.fasta> -bed <input.gff> -fo <output.fasta>

where input.fasta represents FASTA file used for searching, input.gff is the digIS output GFF file and output.fasta is required output file.

Getting Flank Regions using GFF file

bedtools flank -i <input.gff> -g <genome.size> -l <left flank size>  -r <right flank size> 

where genome.size is a text file containing information about chromosomes and their sizes in form: chromosome_name<TAB>chromosome_size. For more information about genome.size file format please see Bedtools documentation.

As the output a new GFF file with positions of flank regions is generated. Then, the appropriate FASTA file with flank sequences can be obtained using bedtools getfasta command described above.