nigyta / dfast_core

DDBJ Fast Annotation and Submission Tool
77 stars 14 forks source link

DFAST - DDBJ Fast Annotation and Submission Tool

DFAST is a flexible and customizable pipeline for prokaryotic genome annotation as well as data submission to the INSDC. It is originally developed as the background engine for the DFAST web service and is also available as a stand-alone command-line tool. The stand-alone version of DFAST is also refered to as DFAST-core to differentiate it from its on-line version.
For inquiry and request, please contact us at dfast @ nig.ac.jp.

Contents

Advanced contents

Installation

If you use Anaconda/Miniconda, see here to install using conda.

Prerequisites

Source code

Available from the GitHub repository nigyta/dfast_core.

For your convenience, create links to DFAST executables in a directory specified by the PATH environment variable. For example,

ln -s $DFAST_APP_ROOT/dfast /usr/local/bin/
ln -s $DFAST_APP_ROOT/scripts/dfast_file_downloader.py /usr/local/bin/

Reference databases

After downloading the source code, prepare reference databases using the bundled utility script.
By default, database files will be generated into the directory under $DFAST_APP_ROOT/db/. You can also change the location of the directory by specifying either --dbroot option or DFAST_DB_ROOT environmental variable.

  1. Default protein database
    dfast_file_downloader.py --protein dfast

    File downloading and database indexing for GHOSTX and BLASTP will be performed.

  2. HMMer and RPS-BLAST databases (this may take time)
    dfast_file_downloader.py --cdd Cog --hmm TIGR

    DFAST default workflow requires COG database for RPS-BLAST and TIGRFAM database for hmmerscan.

    • See help for more information.
      dfast_file_downloader.py -h

Installation via conda

DFAST is also available from Bioconda. Install with:

conda install -c bioconda -c conda-forge dfast

We recommend specifying the latest version. See available versions from here.

conda install -c bioconda -c conda-forge dfast=1.X.XX

If this does not work, please try to install DFAST into the fresh conda environment.

DFAST executables are added to the PATH environmental variable, and the software package is installed in the opt directory under the Anaconda/Miniconda root directory. (e.g. /home/USER/miniconda3/opt/dfast-X.X.X/)

After installing DFAST, download the reference databases:

dfast_file_downloader.py --protein dfast --cdd Cog --hmm TIGR

How to run

  1. Help

    dfast -h

    or by specifying the Python interpreter,

    python $DFAST_APP_ROOT/dfast -h
  2. Test run

    dfast --config $DFAST_APP_ROOT/example/test_config.py

    This minimum workflow includes CDS prediction and database search against the default protein database using the GHOSTX aligner. The result will be generated in RESULT_TEST dierctory.
    If not working properly, please check if the default database is installed. Normally, it finishes within a minute.

  3. Basic usage

    dfast --genome path/to/your_genome.fna(.gz)

    This invokes the DFAST pipeline with the default workflow defined in $DFAST_APP_ROOT/dfc/default_config.py. DFAST accepts a FASTA-formatted genome sequence file as a query.

  4. Advanced usage
    By providing command line options, you can override the default settings described in the configuration file.

    dfast --genome your_genome.fna(.gz) --organism "Escherichia coli" --strain "str. xxx" \
    --locus_tag_prefix ECXXX --minimum_length 200 --references EC_ref_genome.gbk \
    --aligner blastp --out OUT_ECXXX

    'locus tag prefix' is required if you want your genome to be submitted to the INSDC (use --locus_tag_prefix option). DFAST generates DDBJ submission files. For more information, please refer to INSDC submission. If you set --references option, OrthoSearch (orthologous gene assignment) is enabled, which conducts all-against-all protein alignments between given reference genomes to infer orthologous genes.
    --aligner blastp will let DFAST use BLASTP for protein alignments instead of default GHOSTX.

    These optional values can be specified in a configuration file, saving you from providing them as command line options. See the following step.

  5. More advanced usage: Creating your own workflow
    An easy way to do this is to copy and edit the default configuration file, which is located in $DFAST_APP_ROOT/dfc/default_config.py. The configuration file is a self-explanatory Python script, in which the workflow is defined using basic Python objects like lists and dictionaries.

    You can call your original configuration file with the --config option.

    dfast --genome your_genome.fna(.gz) --config your_config.py

Default workflow

DFAST default annotation workflow accepts a genomic FASTA file (draft or complete) as an input and includes following processes. Read Workflow to learn more.

Structural annotation

The following tools are run in parallel to predict biological features (e.g. CDSs and RNAs). After that, partial and overlapping features will be cleaned up.

Optionally, you can choose Prodigal/GeneMarkS2, RNAmmer, tRNAscan-SE to predict CDS, rRNA, tRNA, respectively. See FAQ. (You need to install them manually.)

Functional annotation

  1. OrthoSearch (Optional. Set --references option to enable this.)
  2. DBsearch using the Ghostx aligner against the DFAST default database
  3. PseudoGeneDetection (internal stop codons and frameshifts)
  4. HMMscan against the profile HMM database of TIGRFAM
  5. CDDsearch against COG database from NCBI Conserved Domain Database

By default, GHOSTX is used to align protein sequences. Diamond/BLASTP can be used optionally. See FAQ. (Diamond needs to be installed manually.)

Output

Options

Basic usage:
  usage: dfast -g your_genome.fna [options]

Basic options:
  -g PATH, --genome PATH
                        Genomic FASTA file for input. Can be gzipped.
  -o PATH, --out PATH   Output directory (default:OUT)
  -c PATH, --config PATH
                        Configuration file (default config will be used if not specified)
  --organism STR        Organism name
  --strain STR          Strain name

Genome settings:
  --complete BOOL       Treat the query as a complete genome. Not required unless you need INSDC submission files. [t|f(=default)]
  --use_original_name BOOL
                        Use original sequence names in a query FASTA file [t|f(=default)]
  --sort_sequence BOOL  Sort sequences by length [t(=default)|f]
  --minimum_length INT  Minimum sequence length (default:200)
  --fix_origin          Rotate/flip the chromosome so that the dnaA gene comes first. (ONLY FOR A FINISHED GENOME)
  --offset INT          Offset from the start codon of the dnaA gene. (for --fix_origin option, default=0)

Locus_tag settings:
  --locus_tag_prefix STR
                        Locus tag prefix (defaut:LOCUS)
  --step INT            Increment step of locus tag (default:10)
  --use_separate_tags BOOL
                        Use separate tags according to feature types [t(=default)|f]

Workflow options:
  --threshold STR       Thresholds for default database search (format: "pident,q_cov,s_cov,e_value", default: "0,75,75,1e-6")
  --database PATH       Additional reference database to be searched against prior to the default database. (format: db_path[,db_name[,pident,q_cov,s_cov,e_value]])
  --references PATH     Reference file(s) for OrthoSearch. Use semicolons for multiple files, e.g. 'genome1.faa;genome2.gbk'
  --aligner STR         Aligner to use [ghostx(=default)|blastp|diamond]
  --use_prodigal        Use Prodigal to predict CDS instead of MGA
  --use_genemarks2 STR  Use GeneMarkS2 to predict CDS instead of MGA. [auto|bact|arch]
  --use_trnascan STR    Use tRNAscan-SE to predict tRNA instead of Aragorn. [bact|arch]
  --use_rnammer STR     Use RNAmmer to predict rRNA instead of Barrnap. [bact|arch]
  --gcode INT           Genetic code [11(=default),4(=Mycoplasma)]
  --no_func_anno        Disable all functional annotation steps
  --no_hmm              Disable HMMscan
  --no_cdd              Disable CDDsearch
  --no_cds              Disable CDS prediction
  --no_rrna             Disable rRNA prediction
  --no_trna             Disable tRNA prediction
  --no_crispr           Disable CRISPR prediction
  --metagenome          Set options of MGA/Prodigal for metagenome contigs
  --amr                 [Preliminary implementation] Enable AMR/VFG annotation and identification of plasmid-derived contigs
  --gff GFF             [Preliminary implementation] Read GFF to import structural annotation. Ignores --use_original_name, --sort_sequence, --fix_origin.

Genome source modifiers and metadata [advanced]:
  These values are only used to create INSDC submission files and do not affect the annotation result. See documents for more detail.

  --seq_names STR       Sequence names for each sequence (for complete genome)
  --seq_types STR       Sequence types for each sequence (chromosome/plasmid, for complete genome)
  --seq_topologies STR  Sequence topologies for each sequence (linear/circular, for complete genome)
  --additional_modifiers STR
                        Additional modifiers for source features
  --metadata_file PATH  Path to a metadata file (optional for DDBJ submission file)

Run options:
  --cpu INT             Number of CPUs to use
  --use_locustag_as_gene_id
                        Use locustag as gene ID for FASTA and GFF. (Useful when providing DFAST results to other tools such as Roary)
  --dbroot PATH         DB root directory (default:APP_ROOT/db
  --force               Force overwriting output
  --debug               Run in debug mode (Extra logging and retaining temporary files)
  --show_config         Show pipeline configuration and exit
  --version             Show program version
  -h, --help            Show this help message

Software distribution

DFAST is freely available as open-source under the GPLv3 license (See LICENSE).

This distribution contains following external programs.

Trouble shoot

How to run DFAST within a Docker container.

The Docker container image is available from Dockerhub:nigyta/dtast_core and quay.io:biocontainers/dfast.
Use --dbroot to specity the location of the reference data. Download the reference data:

docker run --rm -v PATH/TO/DB:/dfast_db nigyta/dfast_core:latest dfast_file_downloader.py --protein dfast --cdd Cog --hmm TIGR --dbroot /dfast_db

Invoke DFAST:

docker run --rm -v PATH/TO/DB:/dfast_db -v PATH/TO/YOUR/DATA:/data nigyta/dfast_core:latest dfast --genome /data/your_genome.fa --out /data/your_result --dbroot /dfast_db

Experimental work

Annotation for antibiotic registance genes and virulence fators

Usage

  1. Prepare CARD, VFDB, and PlasmidFinder reference data.

    scripts/dfast_file_downloader.py --plasmidfinder
    scripts/reference_util_for_nucl.py --card --card_version 3.2.9 --vfdb --vfdb_update_date 2024-05-10

    Since VFDB provides only the latest version, the value specified with --vfdb_update_date is used only as a timestamp for the reference data.
    See CARD/download for the latest version of CARD and VFDB/download for the updated date of VFDB.

  2. Run Invoke DFAST with --amr to enable NuclSearch for CARD/VFDB and ContigAnnotation using PlasmidFinder

    dfast -g example/pOXA-48.fa --amr

Citation