Progressive algorithm for large-scale multiple sequence alignments.
-medoidtree
switch) for ultra-scale alignments:
-gt nj
option). NOTE: Neighbour joining trees are calculated with a use of original O(N3) algorithm, thus their applicability on large sets is limited (unless they are used as subtrees with Medoid Tree heuristic).-gz
switch).-keep-duplicates
switch.git clone https://github.com/refresh-bio/FAMSA --recursive
cd FAMSA && make
# align sequences with default parameters (single linkage tree)
./famsa ./test/adeno_fiber/adeno_fiber sl.aln
# align sequences using UPGMA tree with 8 computing threads, store the result in a gzip archive
./famsa -gt upgma -t 8 -gz ./test/adeno_fiber/adeno_fiber upgma.aln.gz
# export a neighbour joining guide tree to the Newick format
./famsa -gt nj -gt_export ./test/adeno_fiber/adeno_fiber nj.dnd
# align sequences with the previously generated guide tree
./famsa -gt import nj.dnd ./test/adeno_fiber/adeno_fiber nj.aln
# align sequences with an approximated medoid guide tree and UPGMA subtrees
./famsa -medoidtree -gt upgma ./test/hemopexin/hemopexin upgma.medoid.aln
# export a distance matrix to the CSV format (lower triangular)
./famsa -dist_export ./test/adeno_fiber/adeno_fiber dist.csv
# export a pairwise identity (PID) matrix to the CSV format (square)
./famsa -dist_export -pid -square_matrix ./test/adeno_fiber/adeno_fiber pid.csv
# profile-profile alignment without refining output
./famsa -refine_mode off ./test/adeno_fiber/upgma.no_refine.part1.fasta ./test/adeno_fiber/upgma.no_refine.part2.fasta pp.fasta
FAMSA comes with a set of precompiled binaries for Windows, Linux, and macOS. They can be found under Releases tab. The software is also available on Bioconda:
conda install -c bioconda famsa
For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. A user-friendly PyFAMSA module authored by Martin Larralde allows running analyzes directly from Python. Finally, FAMSA can be built from the sources distributed as:
FAMSA can be built for x86-64 and ARM64 8 architectures (including Apple M1 based on ARM64 8.4 core) and takes advantage of AVX2 (x86-64) and NEON (ARM) CPU extensions. The default target platform is x86-64 with AVX2 extensions. This, however, can be changed by setting PLATFORM
variable for make
:
make PLATFORM=none # unspecified platform, no extensions
make PLATFORM=sse4 # x86-64 with SSE4.1
make PLATFORM=avx # x86-64 with AVX
make PLATFORM=avx2 # x86-64 with AVX2 (default)
make PLATFORM=native # x86-64 with AVX2 and native architecture
make PLATFORM=arm8 # ARM64 8 with NEON
make PLATFORM=m1 # ARM64 8.4 (especially Apple M1) with NEON
Note, that x86-64 binaries determine the supported extensions at runtime, which makes them backwards-compatible. For instance, the AVX executable will also work on SSE-only platform, but with limited performance. An additional make
option can be used to force static linking (may be helpful when binary portability is desired): make STATIC_LINK=true
The latest speed improvements in FAMSA limited the usefullness of the GPU mode. Thus, starting from the 1.5.0 version, there is no support of GPU in FAMSA. If maximum throughput is required, we encourage using new medoid trees feature (-medoidtree
switch) which allows processing gigantic data sets in short time (e.g., the familiy of 3 million ABC transporters was analyzed in five minutes).
famsa [options] <input_file> [<input_file_2>] <output_file>
Positional parameters:
input_file
, input_file_2
- input files in FASTA format (first input can be replaced with STDIN string to read from standard input); action depends on the number of input files:
output_file
- output file (pass STDOUT when writing to standard output); available outputs:
-gt_export
option specified),-dist_export
option specified).Options:
-help
- show advanced options
-t <value>
- no. of threads, 0 means all available (default: 0)
-v
- verbose mode, show timing information (default: disabled)
-gt <sl | upgma | nj | import <file>>
- the guide tree method (default: sl):
sl
- single linkage,upgma
- UPGMA,nj
- neighbour joining,import <file>
- import from a Newick file.-medoidtree
- use MedoidTree heuristic for speeding up tree construction (default: disabled)
-medoid_threshold <n_seqs>
- if specified, medoid trees are used only for sets with n_seqs
or more
-gt_export
- export a guide tree to output file in the Newick format
-dist_export
- export a distance matrix to output file in CSV format
-square_matrix
- generate a square distance matrix instead of a default triangle
-pid
- calculate percent identity (the number of matching residues divided by the shorter sequence length) instead of distance
-keep-duplicates
- keep duplicated sequences during alignment (default: disabled - duplicates are removed prior and restored after the alignment)
-gz
- enable gzipped output (default: disabled)
-gz-lev <value>
- gzip compression level [0-9] (default: 7)
-refine_mode <on | off | auto>
- refinement mode (default: auto
- the refinement is enabled for sets <= 1000 seq.)
FAMSA has the ability to import/export alignment guide trees in Newick format. E.g., in order to generate a UPGMA tree from the input.fasta file and store it in the tree.dnd file, run:
famsa -gt upgma -gt_export input.fasta tree.dnd
To align the sequences from input.fasta using the tree from tree.dnd and store the result in out.fasta, run:
famsa -gt import tree.dnd input.fasta out.fasta
Below one can find example guide tree file for sequences A, B, and C:
(A:0.1,(B:0.2,C:0.3):0.4);
Note, that when importing the tree, the branch lengths are not taken into account, though they have to be specified in a file for successful parsing. When exporting the tree, all the branches are assigned with length 1, thus only the structure of the tree can be restored (we plan to output real lengths in the future release):
(A:1.0,(B:1.0,C:1.0):1.0);
The major algorithmic features in FAMSA are:
-medoidtree
switch and allow aligning millions of sequences in minutes.The analysis was performed on our extHomFam 2 benchmark produced by combining Homstrad (March 2020) references with Pfam 33.1 families (NCBI variant). The data set was deposited at Zenodo: https://zenodo.org/record/6524237. The following algorithms were investigated:
Name | Version | Command line |
---|---|---|
ClustalΩ | 1.2.4 | clustalo --threads=32 -i <input> -o <output> |
ClustalΩ iter2 | 1.2.4 | clustalo --threads=32 --iter 2 -i <input> -o <output> |
MAFFT PartTree | 7.453 | mafft --thread 32 --anysymbol --quiet --parttree <input> -o <output> |
MAFFT DPPartTree | 7.453 | mafft --thread 32 --anysymbol --quiet --dpparttree <input> -o <output> |
Kalign3 | 3.3.2 | kalign -i <input> -o <output> |
FAMSA | 1.6.2 | famsa -t 32 <input> <output> |
FAMSA 2 | 2.0.1 | famsa -t 32 -gz <input> <output> |
FAMSA 2 Medoid | 2.0.1 | famsa -t 32 -medoidtree -gt upgma -gz <input> <output> |
The tests were performed with 32 computing threads on a machine with AMD Ryzen Threadripper 3990X CPU and 256 GB of RAM. For each extHomFam 2 subset we measured a fraction of properly aligned columns (TC score) as well as a total running time and a maximum memory requirements. The results are presented in the figure below. Notches at boxplots indicate 95% confidence interval for median, triangle represent means. The missing series for some algorithm-set pairs indicate that the running times exceeded a week. Kalign3 failed to process 10 families (5 in second, 3 in fourth, and 2 in the largest subset). FAMSA 2 alignments were stored in gzip format (-gz
switch).
The most important observations are as follows:
Benchmark data sets developed and used in the FAMSA study: