phglab / ALFATClust

Biological sequence clustering tool with dynamic threshold
GNU General Public License v3.0
23 stars 6 forks source link

ALFATClust - ALignment-Free Adaptive Threshold Clustering

Overview

Biological sequence clustering tool with dynamic threshold for individual clusters. Suitable for clustering multiple groups of homologous sequences.

Citation

Chiu, J.K.H., Ong, R.TH. Clustering biological sequences with dynamic sequence similarity threshold. BMC Bioinformatics 23, 108 (2022). https://doi.org/10.1186/s12859-022-04643-9

Release update

Sequence file requirements

The input sequence file must be:

  1. Consisting of either DNA or protein sequences
  2. In FASTA format
  3. FASTA sequence header can only contain at most one whitespace, and no flanking whitespace allowed.

Pre-processing of sequence file

A pre-processing workflow consisting of the following three utilities is provided here to ensure the input FASTA sequences conform to the requirements above.

  1. Sequence filtering (_filterseqs.py):
    Scan the input sequence file to identify and filter sequences for the following issues:

    a. Unidentifiable amino acids/DNA bases (e.g. U as an amino acid or X as a DNA base)
    b. Over 5% of the amino acids/DNA bases are ambiguous (e.g. R or N for DNA)
    c. Sequence length less than the Mash k-mer size used

  2. Sequence header whitespace replacement (_replace_seq_headerspaces.py):
    Replace every whitespace in the FASTA sequence header by an underscore (_) character

  3. RNA to DNA conversion (_rna_todna.py):
    Convert RNA sequences into DNA sequences for clustering

Installation

Mash [1] can be installed using apt in Ubuntu; an alternative is to download its source codes (requires compilation) or binaries from here. MMseqs2 [2] is used for pre-clustering only. Make sure they are included in the system path.

Usage

Command

Mandatory arguments

Argument name Description
-i/--input \ (full/relative) input DNA/protein sequence FASTA file path
-o/--output \ (full/relative) output sequence cluster file path

Optional arguments

Argument name Description [default value]
-e/--evaluate \ evaluate the clusters and export the evaluation results to (full/relative) \
-b/--target [aa/dna/auto] specify input sequences as protein (aa) / DNA (dna) sequences, or let the tool to detemine (auto) [auto]
-l/--lower \ set the lower bound of the sequence distance estimate (resolution parameter) to \ [0.75]
-d/--step \ set the step size of the sequence distance estimate range to \ [0.025]
-p/--precluster always run pre-clustering
-k/--kmer \ set the Mash kmer size parameter to \ [DNA: 17; protein: 9]
-s/--sketch \ set the Mash sketch size parameter to \ [2000]
-m/--margin \ ignore any Mash distance above 1 - max(\ - \, 0) [0.2]
-f/--filter \ discard a Mash distance when its shared hash ratio is below \, NOT recommended
-n/--no-reverse disable reverse complement for DNA sequences during Mash distance estimation
-t/--thread \ set the number of threads to \ (for Mash and cluster evaluation only) [all available CPU cores]
-S/--seed \ set the seed value to \
-h/--help show help message and exit

Evaluation report

The evaluation report consists of the following columns:

Column name Description
Cluster Id Cluster Id for the non-singleton cluster
No. of sequences Number of sequences in the cluster
Average sequence identity Cluster average pairwise sequence identity with respect to the center sequence*
Min. sequence identity Cluster minimum pairwise sequence identity with respect to the center sequence*
Center sequence Representative center sequence selected for cluster
Sequence for min. sequence identity Sequence showing the lowest pairwise sequence identity with the center sequence

*Sequence identity = number of matched nucleotides or amino acids / (alignment length - terminal gaps)

Configuration file

The configuration file "settings.cfg" is located under directory "/usr/local/bin/phglab/alfatclust" inside the Docker container, or under the same host directory as the main Python script "alfatclust.py". It consists of the default values for the following parameters organized into various categories:

Sample datasets

The sample datasets are available in folder sample_datasets, which includes:

  1. Antimicrobial resistance (AMR) gene datasets (data sources: ARG-ANNOT [3], CARD [4-6], and ResFinder [7]) argdit_nt_06feb2020_full.fa (DNA) argdit_aa_06feb2020_full.fa (protein)

  2. Non-AMR plasmid gene dataset (data source: PLSDB [8]) plasmid_genes_20191017.fa (DNA)

References

[1] Ondov, B. D., et al. (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1), 132.
[2] Steinegger, M. and J. Söding. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, 1026.
[3] Gupta, S. K., et al. (2014). ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes. Antimicrobial Agents and Chemotherapy, 58(1), 212-220.
[4] Alcock, B. P., et al. (2019). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, 48(D1), D517-D525.
[5] Jia, B., et al. (2017). CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(D1), D566-D573.
[6] McArthur, A. G., et al. (2013). The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy, 57(7), 3348-3357.
[7] Zankari, E., et al. (2012). Identification of acquired antimicrobial resistance genes. Journal of Antimicrobial Chemotherapy, 67(11), 2640-2644.
[8] Galata, V., et al. (2018). PLSDB: a resource of complete bacterial plasmids. Nucleic Acids Research, 47(D1), D195-D202.