Biological sequence clustering tool with dynamic threshold for individual clusters. Suitable for clustering multiple groups of homologous sequences.
Chiu, J.K.H., Ong, R.TH. Clustering biological sequences with dynamic sequence similarity threshold. BMC Bioinformatics 23, 108 (2022). https://doi.org/10.1186/s12859-022-04643-9
ModuleNotFoundError: No module named 'Constants'
when using in Docker container-b
option to specify the type of input sequences (DNA/protein), or leave it to the tool to determine-n
option to disable reverse complement for DNA sequences when estimating their pairwise distances.The input sequence file must be:
A pre-processing workflow consisting of the following three utilities is provided here to ensure the input FASTA sequences conform to the requirements above.
Sequence filtering (_filterseqs.py):
Scan the input sequence file to identify and filter sequences for the following issues:
a. Unidentifiable amino acids/DNA bases (e.g. U as an amino acid or X as a DNA base)
b. Over 5% of the amino acids/DNA bases are ambiguous (e.g. R or N for DNA)
c. Sequence length less than the Mash k-mer size used
Sequence header whitespace replacement (_replace_seq_headerspaces.py):
Replace every whitespace in the FASTA sequence header by an underscore (_) character
RNA to DNA conversion (_rna_todna.py):
Convert RNA sequences into DNA sequences for clustering
Method 1: Docker
ALFATClust is available as a Docker package, from which a Docker image can be built and then use it to create a Docker container as a virtual environment. Details of Docker and its installation can be found here.
Step 1: The Docker image can be built via either the repository URL or the local directory:
Option 1: Build with repository URL
The following command builds a Docker image managed by the host Docker engine:
docker build -t \
github.com/phglab/ALFATClust
The Docker image built will be named as \
Option 2: Build locally
Image can also be built after cloning or downloading the ALFATClust repository to local directory:
docker build -t \
\
\
Step 2: Once the image is built, a Docker container can be created from it:
docker run -it --mount type=bind,src=\
,dst=\ --name \ \
A directory (specified in \
Refer to here for more mounting options.
Step 3: To start an existing container:
docker start -ai \
\
Step 4: To exit the container, run the following command in the terminal running it:
exit
Method 2: conda
Machines having Miniconda or Anaconda installed can set up a conda environment to run ALFATClust.
Step 1: Update conda to the latest version, e.g. for Miniconda it can be updated via terminal:
conda update conda
Step 2: Create a new conda environment for ALFATClust execution using the environment file "alfatclust-conda.yml":
conda env create -n \
--file \
The conda environment created will be named as \
Step 3: Activate the created conda environment by:
conda activate \
\
Step 4: To deactivate (exit) the conda environment, run the following command:
conda deactivate
Method 3: Direct execution in host
The source codes of ALFATClust are under the directory "main". Simply copy the contents in the "main" folder to a local folder. Users may consider adding the path of this local folder to PATH variable. Also, make sure the following tools and libraries are properly installed and can be invoked by ALFATClust. The version tested is indicated in parentheses.
Python runtime:
Python packages:
Third-party tool:
Mash [1] can be installed using apt in Ubuntu; an alternative is to download its source codes (requires compilation) or binaries from here. MMseqs2 [2] is used for pre-clustering only. Make sure they are included in the system path.
Command
Docker:
alfatclust [optional arguments] -i \
-o \
Note: Both full and relative file paths are accepted.
conda/Direct execution (assuming the current working directory is the root directory of ALFATClust):
./alfatclust.py [optional arguments] -i \
-o \
Note: When the current working directory is somewhere else, locate "alfatclust.py" using a (full/relative) path instead.
Mandatory arguments
Argument name | Description |
---|---|
-i/--input \ |
(full/relative) input DNA/protein sequence FASTA file path |
-o/--output \ |
(full/relative) output sequence cluster file path |
Optional arguments
Argument name | Description [default value] |
---|---|
-e/--evaluate \ |
evaluate the clusters and export the evaluation results to (full/relative) \ |
-b/--target [aa/dna/auto] |
specify input sequences as protein (aa) / DNA (dna) sequences, or let the tool to detemine (auto) [auto] |
-l/--lower \ |
set the lower bound of the sequence distance estimate (resolution parameter) to \ |
-d/--step \ |
set the step size of the sequence distance estimate range to \ |
-p/--precluster |
always run pre-clustering |
-k/--kmer \ |
set the Mash kmer size parameter to \ |
-s/--sketch \ |
set the Mash sketch size parameter to \ |
-m/--margin \ |
ignore any Mash distance above 1 - max(\ |
-f/--filter \ |
discard a Mash distance when its shared hash ratio is below \ |
-n/--no-reverse |
disable reverse complement for DNA sequences during Mash distance estimation |
-t/--thread \ |
set the number of threads to \ |
-S/--seed \ |
set the seed value to \ |
-h/--help |
show help message and exit |
Evaluation report
The evaluation report consists of the following columns:
Column name | Description |
---|---|
Cluster Id | Cluster Id for the non-singleton cluster |
No. of sequences | Number of sequences in the cluster |
Average sequence identity | Cluster average pairwise sequence identity with respect to the center sequence* |
Min. sequence identity | Cluster minimum pairwise sequence identity with respect to the center sequence* |
Center sequence | Representative center sequence selected for cluster |
Sequence for min. sequence identity | Sequence showing the lowest pairwise sequence identity with the center sequence |
*Sequence identity = number of matched nucleotides or amino acids / (alignment length - terminal gaps)
Configuration file
The configuration file "settings.cfg" is located under directory "/usr/local/bin/phglab/alfatclust" inside the Docker container, or under the same host directory as the main Python script "alfatclust.py". It consists of the default values for the following parameters organized into various categories:
EstimatedSimilarity
Threshold
DNAMash
ProteinMash
NoiseFilter
DNAEvaluation (for sequence alignment during DNA cluster evaluation)
ProteinEvaluation (for sequence alignment during protein cluster evaluation)
The sample datasets are available in folder sample_datasets, which includes:
Antimicrobial resistance (AMR) gene datasets (data sources: ARG-ANNOT [3], CARD [4-6], and ResFinder [7]) argdit_nt_06feb2020_full.fa (DNA) argdit_aa_06feb2020_full.fa (protein)
Non-AMR plasmid gene dataset (data source: PLSDB [8]) plasmid_genes_20191017.fa (DNA)
[1] Ondov, B. D., et al. (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1), 132.
[2] Steinegger, M. and J. Söding. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, 1026.
[3] Gupta, S. K., et al. (2014). ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes. Antimicrobial Agents and Chemotherapy, 58(1), 212-220.
[4] Alcock, B. P., et al. (2019). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, 48(D1), D517-D525.
[5] Jia, B., et al. (2017). CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(D1), D566-D573.
[6] McArthur, A. G., et al. (2013). The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy, 57(7), 3348-3357.
[7] Zankari, E., et al. (2012). Identification of acquired antimicrobial resistance genes. Journal of Antimicrobial Chemotherapy, 67(11), 2640-2644.
[8] Galata, V., et al. (2018). PLSDB: a resource of complete bacterial plasmids. Nucleic Acids Research, 47(D1), D195-D202.