CentroidFold is one of the most accurate tools for predicting RNA secondary structures. It is based on the gamma-centroid estimator (Hamada et.al., 2009) for high-dimensional discrete spaces. Generally, a gamma-centroid estimator is slightly more accurate than an MEA estimator (Do et.al., 2005) under the same probability distribution.
CentroidAlifold can predict common secondary structures for multiple alignments of RNA sequences by using an averaged gamma-centroid estimator (Hamada et al., 2010).
CentroidFold can employ various probability distributions. Currently,
According to our benchmark, CentroidFold with the McCaskill model using Boltzmann likelihood parameters (Andronescue et al., 2010) will predict the most accurate RNA secondary structures among currently available prediction tools at this time.
./configure && make && make install
If the Vienna RNA package has been installed at the non-standard directory, specify the location like:
./configure --with-vienna-rna=/path/to/vienna-rna
mkdir build && cd build
cmake .. && make
docker build . -t centroid_fold
docker run --rm -it -v $(pwd):/workspaces centroid_fold centroid_fold -g -1 RF00008_B.fa
centroid_fold
can take the FASTA format as an input sequence,
then predict its secondary structure.
centroid_alifold
can take the CLUSTAL format as an input
aligment, the predict its common secondary structure.
centroid_fold
is the main program of this package. It calculates
a base-pairing probability matrix for a given sequence with one of
several algorithms including the CONTRAfold model and the McCaskill
algorithm using pf_fold
routine in Vienna RNA package. Then,
centroid_fold
estimates a gamma-centroid estimator for the
base-pairing probability matrix.
centroid_fold [options] seq
Options:
-e [ --engine ] arg specify the inference engine (default: "McCaskill")
-g [ --gamma ] arg weight of base pairs
--noncanonical allow non-canonical base-pairs
-C [ --constraints ] use structure constraints
--postscript arg draw predicted secondary structures with the
postscript (PS) format
By -e
or --engine
option, you can select the inference engine for
calculating base-pairing probabilities from the CONTRAfold model
("CONTRAfold"), the McCaskill model ("McCaskill" if available) and
the pfold model ("pfold" if available). If you use the pfold model,
several environment variables should be set: PFOLD_BIN_DIR
to the
directory which contains pfold binaries, and AWK_BIN
and SED_BIN
to awk and sed program available on your system, respectively.
If a negative value is given for the option --gamma
,
centroid_fold
calculates secondary structures for several values
of gamma at the same time ({2^k | -5 <= k <= 10} and 6).
For long sequences, you can use -d
options to restrict the maximum
distance of base-pairs. This can reduce the computation time and the
memory requirement: O(L^3) to O(LW^2), and O(L^2) to O(LW),
respectively, where L is length of a sequence and W is the specified
maximum distance of base-pairs. This option is available only for the
CONTRAfold model.
If a part of secondary structure for a given sequence is known, you can specify it by a modified FASTA format like:
> RF00008_B
CAAAAGUCUGGGCUAAGCCCACUGAUGAGCCGCUGAAAUGCGGCGAAACUUUUG
.(((((........???????????????????????...........))))).
and run centroid_fold
with -C
options. The positions at '('
and ')' are restricted to base-pairs, the position at '.' is
restricted to unpaired bases, and the position at '?' is
unrestricted.
If --sampling
option is given, centroid_fold
uses the
stochastic traceback algorithm instead of the McCaskill's base-pairing
probability matrix. Like Sfold (Ding et al., 2005), build clusters of
secondary structures, and then compute their centroids. The number of
clusters can be specified by --max-clusters
option.
Example:
% centroid_fold -g -1 RF00008_B.fa
> RF00008_B
CAAAAGUCUGGGCUAAGCCCACUGAUGAGCCGCUGAAAUGCGGCGAAACUUUUG
.(((((...(((.....)))........(((((......)))))....))))). (g=0.03125,th=0.969697)
.(((((...(((.....)))........(((((......)))))....))))). (g=0.0625,th=0.941176)
((((((...(((.....)))........(((((......)))))....)))))) (g=0.125,th=0.888889)
((((((...((((...))))........(((((......)))))....)))))) (g=0.25,th=0.8)
(((((((..((((...))))........(((((......)))))...))))))) (g=0.5,th=0.666667)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=1,th=0.5)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=2,th=0.333333)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=4,th=0.2)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=6,th=0.142857)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=8,th=0.111111)
(((((((((((((...))))).......(((((......))))))..))))))) (g=16,th=0.0588235)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=32,th=0.030303)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=64,th=0.0153846)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=128,th=0.00775194)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=256,th=0.00389105)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=512,th=0.00194932)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=1024,th=0.00097561)
For the CLUSTAL format, centroid_alifold
predicts common
secondary structures for the given multiple alignments.
centroid_alifold [options] seq
Options:
-h [ --help ] show this message
-e [ --engine ] arg specify the inference engine (default: "McCaskill &
Alifold")
-w [ --mixture ] arg mixture weights of inference engines
-g [ --gamma ] arg weight of base pairs
--noncanonical allow non-canonical base-pairs
-C [ --constraints ] use structure constraints
--postscript arg draw predicted secondary structures with the
postscript (PS) format
By -e
or --engine
option, you can select the inference engine for
calculating base-pairing probabilities from the CONTRAfold model
("CONTRAfold"), the McCaskill model ("McCaskill" if available),
the RNAalifold model ("Alifold" if available) and the pfold model
("pfold" if avilable).
If you specify the inference engines multiply, centroid_alifold
employs a mixtured baes-pairing probability matrix. The mixture
weight can be set by -w
or --mixture
option. The default
setting of centroid_alifold
is -e McCaskill -w 1.0 -e Alifold -w 1.0
. See more detail in (Hamada et al., 2010).
Example:
% centroid_alifold -g -1 RF00436.aln
>AB029447-1/1210-1265
--BCAHuUGYAVgUCGCUUUGGAYAaaAG--CGUCUGCUAAAUGM-VURwrukKAAAUDu-
............................................................. (g=0.03125,th=0.969697)
............................................................. (g=0.0625,th=0.941176)
............................................................. (g=0.125,th=0.888889)
............................................................. (g=0.25,th=0.8)
...............(((.........))..)............................. (g=0.5,th=0.666667)
..............(((((.......)))..))............................ (g=1,th=0.5)
............(.((((((.....))))..)).).......................... (g=2,th=0.333333)
........((..(.((((((.....))))..)).).))....................... (g=4,th=0.2)
...((((.((.((.((((((.....))))..)).))))..))))................. (g=6,th=0.142857)
...(((((((.((.((((((.....))))..)).)))).)))))................. (g=8,th=0.111111)
..((((((((.((.((((((.....))))..)).)))).))))))................ (g=16,th=0.0588235)
..((((((((.((.((((((.....))))..)).)))).))))))((.....))....... (g=32,th=0.030303)
..((((((((.((.((((((.....))))..)).)))).))))))((.....))....... (g=64,th=0.0153846)
..((((((((.((.((((((.....))))..)).)))).))))))((.....))....... (g=128,th=0.00775194)
..((((((((.((.((((((.....))))..)).)))).))))))(((...)))....... (g=256,th=0.00389105)
.(((((((((.((.((((((.....))))..)).)))).)))))).).((((....)))). (g=512,th=0.00194932)
.(((((((((.((.((((((.....))))..)).)))).)))))).)(((((....))))) (g=1024,th=0.00097561)
The first line of the result is the description of the first sequence in the given alignment. The second is the "most informative sequence" (Freyhult et al., 2005), which is similar to IUPAC ambiguity characters, produced by a library routine of the Vienna Package.
If homologous sequences to the target sequence are available,
centroid_homfold
can predict secondary structures for the target
sequence more accurately than centroid_fold
using homologous
sequence information with the probabilistic consistency transformation
for base-pairing probabilities (Hamada et al, 2009).
centroid_homfold [options] seq
Options:
-H [ --homologous ] arg fasta file containing homologous sequences (REQUIRED)
--engine_s arg specify the inference engine for secondary structures
(default: "McCaskill")
--engine_a arg specify the inference engine for pairwise alignments
(default: "CONTRAlign")
-g [ --gamma ] arg weight of base pairs
--postscript arg draw predicted secondary structures with the
postscript (PS) format
You should give homologous sequences to the target sequence in the
FASTA format by -H
option.
Example:
% centroid_homfold -g -1 -H RF00005.fa seq.fa
>X12857.1/421-494
GCGGAUGUAGCCAAGUGGAUCAAGGCAGUGGAUUGUGAAUCCACCAUGCGCGGGUUCAAUUCCCGUCAUUCGCC
.......................................................................... (g=0.03125,th=0.969697,e=0)
.......................................................................... (g=0.0625,th=0.941176,e=0)
.......................................................................... (g=0.125,th=0.888889,e=0)
.......................................................................... (g=0.25,th=0.8,e=0)
.......................................................................... (g=0.5,th=0.666667,e=0)
.......................................................................... (g=1,th=0.5,e=0)
.(((((.............................................(((.......)))...))))).. (g=2,th=0.333333,e=-1.21)
(((((((...........................................((((.......)))).))))))). (g=4,th=0.2,e=-10.8)
(((((((.....................((((.......))))......(((((.......)))))))))))). (g=6,th=0.142857,e=-17.9)
(((((((...(.............)..(((((.......))))).....(((((.......)))))))))))). (g=8,th=0.111111,e=-17.22)
(((((((..(((...........))).(((((.......))))).....(((((.......)))))))))))). (g=16,th=0.0588235,e=-23.8)
(((((((..((((.........))))(((((((.....))))))..)..((((((....).)))))))))))). (g=32,th=0.030303,e=-13.3)
(((((((..((((.(....)..))))((((((((...)))))))..)..((((((....).)))))))))))). (g=64,th=0.0153846,e=-8.9)
(((((((..((((((....)).))))((((((((...)))))))..)..((((((....).)))))))))))). (g=128,th=0.00775194,e=-10.6)
(((((((..((((((....)).))))((((((((...)))))))..)..((((((....).)))))))))))). (g=256,th=0.00389105,e=-10.6)
(((((((.(((((((....)).))))((((((((...)))))))..).)((((((....).)))))))))))). (g=512,th=0.00194932,e=-4.9)
(((((((.(((((((....)).))))((((((((...)))))))..).)((((((....).)))))))))))). (g=1024,th=0.00097561,e=-4.9)