padlocbio / padloc

Locate antiviral defence systems in prokaryotic genomes
MIT License
45 stars 9 forks source link

PADLOC: Prokaryotic Antiviral Defence LOCator

[!IMPORTANT] PADLOC >v2.0.0 is only compatible with PADLOC-DB >v2.0.0 and vice-versa. After you update PADLOC, make sure to update your database by running: padloc --db-update.

About

PADLOC is a software tool for identifying antiviral defence systems in prokaryotic genomes. PADLOC screens genomes against a database of HMMs and system classifications to find and annotate defence systems based on sequence homology and genetic architecture.

Citation

If you use PADLOC or PADLOC-DB please cite:

Payne, L. J., Todeschini, T. C., Wu, Y., Perry, B. J., Ronson, C. W., Fineran, P. C., Nobrega, F. L., Jackson, S. A. (2021) Identification and classification of antiviral defence systems in bacteria and archaea with PADLOC reveals new system types. Nucleic Acids Research, 49, 10868-10878. doi: https://doi.org/10.1093/nar/gkab883

If you use the PADLOC web server please additionally cite:

Payne, L. J., Meaden S., Mestre M. R., Palmer C., Toro N., Fineran P. C. and Jackson S. A. (2022) PADLOC: a web server for the identification of antiviral defence systems in microbial genomes. Nucleic Acids Research, 50, W541-W550. doi: https://doi.org/10.1093/nar/gkac400

The HMMs and system models in PADLOC-DB were built and curated using the data and conclusions from many different sources, we encourage you to also give credit to these groups by reading their work and citing them where appropriate. References to relevant literature can be found at the PADLOC-DB repository.

Installation

Conda

It is recommended that PADLOC be installed via conda.

# Install PADLOC into a new conda environment
conda create -n padloc -c conda-forge -c bioconda -c padlocbio padloc=2.0.0
# Activate the environment
conda activate padloc
# Download the latest database
padloc --db-update

If you're having installation issues, refer to Issue #35.

Examples

# BASIC: Search an amino acid fasta file with accompanying GFF annotations
padloc --faa genome.faa --gff features.gff
# INTERMEDIATE: Use multiple cpus and save output to a different directory
padloc --faa genome.faa --gff features.gff --outdir path_to_output --cpu 4
# ADVANCED: Supply ncRNA and CRISPR array data
padloc --faa genome.faa --gff features.gff --ncrna genome.ncrna --crispr genome.crispr

[!NOTE] Refer to padloc/etc/README.md for instructions on pre-computing ncRNA and CRISPR array data.

Test

# Try running PADLOC on the test data provided
padloc --faa padloc/test/GCF_001688665.2.faa --gff padloc/test/GCF_001688665.2.gff
padloc --fna padloc/test/GCF_004358345.1.fna

Options

General:
  --help            Print this help message
  --version         Print version information
  --citation        Print citation information
  --check-deps      Check that dependencies are installed
  --debug           Run with debug messages
Database:
  --db-list         List all PADLOC-DB releases
  --db-install [n]  Install specific PADLOC-DB release [n]
  --db-update       Install latest PADLOC-DB release
  --db-version      Print database version information
Input:
  --faa [f]         Amino acid FASTA file (only valid with [--gff])
  --gff [f]         GFF file (only valid with [--faa])
  --fna [f]         Nucleic acid FASTA file
  --crispr [f]      CRISPRDetect output file containing array data
  --ncrna [f]       Infernal output file containing ncRNA data
Output:
  --outdir [d]      Output directory
Optional:
  --data [d]        Data directory
  --cpu [n]         Use [n] CPUs (default '1')
  --fix-prodigal    Set this flag when providing an FAA and GFF file
                    generated with prodigal to force fixing of sequence IDs

Output

Extension Description
.domtblout Domain table file generated by HMMER.
_prodigal.faa Amino acid FASTA file generated by prodigal.
_prodigal.gff GFF annotation file generated by prodigal.
_padloc.csv PADLOC output file for identified defence systems.
_padloc.gff GFF annotation file for identified defence systems.

Interpreting Output

Column Description
system.number Distinct system number.
seqid Sequence ID of the contig.
system Name of the system identified.
target.name Protein ID.
hmm.accession PADLOC HMM accession number.
hmm.name PADLOC HMM name.
protein.name Defence system protein name.
full.seq.E.value Full sequence E-value. From the HMMER Documentation: "The E-value is a measure of statistical significance. The lower the E-value, the more significant the hit."
domain.iE.value Domain E-value. From the HMMER Documentation: "If the full sequence E-value is significant but the single best domain E-value is not, the target sequence is probably a multidomain remote homolog".
target.coverage Fraction of the target sequence aligning to the HMM.
hmm.coverage Fraction of the HMM aligning to the target sequence.
start Start position of the target sequence in the contig.
end End position of the target sequence in the contig.
strand Strand; forward (+) or reverse (-)
target.description Target sequence descrition taken from the input file.
relative.position Relative position of the target sequence in the contig.
contig.end Relative position of the last sequence in the contig.
all.domains Concatenated list of all domains identified with HMMER.
best.hits Top 5 hits identified with HMMER.

PADLOC-DB

The HMMs and defence system models used by PADLOC are available from the PADLOC-DB repository. The latest version of the database can be downloaded by running padloc --db-update. Alternatively, a custom database can be specified with --data, refer to PADLOC-DB for more information about the database.

FAQ

Issues

Bugs and feature requests can be submitted to the Issues tab (see Sample bug report).

Dependencies

These dependencies are automatically installed when installing PADLOC via conda.

License

This software and data is available as open source under the terms of the MIT License.