satoken / centroid-rna-package

Centroid RNA package
GNU General Public License v2.0
19 stars 6 forks source link

CentroidFold for predicting RNA secondary structures

CentroidFold is one of the most accurate tools for predicting RNA secondary structures. It is based on the gamma-centroid estimator (Hamada et.al., 2009) for high-dimensional discrete spaces. Generally, a gamma-centroid estimator is slightly more accurate than an MEA estimator (Do et.al., 2005) under the same probability distribution.

CentroidAlifold can predict common secondary structures for multiple alignments of RNA sequences by using an averaged gamma-centroid estimator (Hamada et al., 2010).

CentroidFold can employ various probability distributions. Currently,

According to our benchmark, CentroidFold with the McCaskill model using Boltzmann likelihood parameters (Andronescue et al., 2010) will predict the most accurate RNA secondary structures among currently available prediction tools at this time.

Requirements

Install

Using configure

./configure && make && make install

If the Vienna RNA package has been installed at the non-standard directory, specify the location like:

./configure --with-vienna-rna=/path/to/vienna-rna

Using cmake

mkdir build && cd build
cmake .. && make

Using Docker

docker build . -t centroid_fold
docker run --rm -it -v $(pwd):/workspaces centroid_fold centroid_fold -g -1 RF00008_B.fa 

Usage

centroid_fold can take the FASTA format as an input sequence, then predict its secondary structure. centroid_alifold can take the CLUSTAL format as an input aligment, the predict its common secondary structure.

Secondary structure prediction for single sequences

centroid_fold is the main program of this package. It calculates a base-pairing probability matrix for a given sequence with one of several algorithms including the CONTRAfold model and the McCaskill algorithm using pf_fold routine in Vienna RNA package. Then, centroid_fold estimates a gamma-centroid estimator for the base-pairing probability matrix.

centroid_fold [options] seq

Options:
  -e [ --engine ] arg     specify the inference engine (default: "McCaskill")
  -g [ --gamma ] arg      weight of base pairs
  --noncanonical          allow non-canonical base-pairs
  -C [ --constraints ]    use structure constraints
  --postscript arg        draw predicted secondary structures with the 
                          postscript (PS) format

By -e or --engine option, you can select the inference engine for calculating base-pairing probabilities from the CONTRAfold model ("CONTRAfold"), the McCaskill model ("McCaskill" if available) and the pfold model ("pfold" if available). If you use the pfold model, several environment variables should be set: PFOLD_BIN_DIR to the directory which contains pfold binaries, and AWK_BIN and SED_BIN to awk and sed program available on your system, respectively.

If a negative value is given for the option --gamma, centroid_fold calculates secondary structures for several values of gamma at the same time ({2^k | -5 <= k <= 10} and 6).

For long sequences, you can use -d options to restrict the maximum distance of base-pairs. This can reduce the computation time and the memory requirement: O(L^3) to O(LW^2), and O(L^2) to O(LW), respectively, where L is length of a sequence and W is the specified maximum distance of base-pairs. This option is available only for the CONTRAfold model.

If a part of secondary structure for a given sequence is known, you can specify it by a modified FASTA format like:

> RF00008_B
CAAAAGUCUGGGCUAAGCCCACUGAUGAGCCGCUGAAAUGCGGCGAAACUUUUG
.(((((........???????????????????????...........))))).

and run centroid_fold with -C options. The positions at '(' and ')' are restricted to base-pairs, the position at '.' is restricted to unpaired bases, and the position at '?' is unrestricted.

If --sampling option is given, centroid_fold uses the stochastic traceback algorithm instead of the McCaskill's base-pairing probability matrix. Like Sfold (Ding et al., 2005), build clusters of secondary structures, and then compute their centroids. The number of clusters can be specified by --max-clusters option.

Example:

% centroid_fold -g -1 RF00008_B.fa
> RF00008_B
CAAAAGUCUGGGCUAAGCCCACUGAUGAGCCGCUGAAAUGCGGCGAAACUUUUG
.(((((...(((.....)))........(((((......)))))....))))). (g=0.03125,th=0.969697)
.(((((...(((.....)))........(((((......)))))....))))). (g=0.0625,th=0.941176)
((((((...(((.....)))........(((((......)))))....)))))) (g=0.125,th=0.888889)
((((((...((((...))))........(((((......)))))....)))))) (g=0.25,th=0.8)
(((((((..((((...))))........(((((......)))))...))))))) (g=0.5,th=0.666667)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=1,th=0.5)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=2,th=0.333333)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=4,th=0.2)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=6,th=0.142857)
(((((((.(((((...))))).......(((((......)))))...))))))) (g=8,th=0.111111)
(((((((((((((...))))).......(((((......))))))..))))))) (g=16,th=0.0588235)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=32,th=0.030303)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=64,th=0.0153846)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=128,th=0.00775194)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=256,th=0.00389105)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=512,th=0.00194932)
(((((((((((((...)))))..).(..(((((......)))))..)))))))) (g=1024,th=0.00097561)

Common secondary structure prediction for multiple alignments

For the CLUSTAL format, centroid_alifold predicts common secondary structures for the given multiple alignments.

centroid_alifold [options] seq

Options:
  -h [ --help ]           show this message
  -e [ --engine ] arg     specify the inference engine (default: "McCaskill & 
                          Alifold")
  -w [ --mixture ] arg    mixture weights of inference engines
  -g [ --gamma ] arg      weight of base pairs
  --noncanonical          allow non-canonical base-pairs
  -C [ --constraints ]    use structure constraints
  --postscript arg        draw predicted secondary structures with the 
                          postscript (PS) format

By -e or --engine option, you can select the inference engine for calculating base-pairing probabilities from the CONTRAfold model ("CONTRAfold"), the McCaskill model ("McCaskill" if available), the RNAalifold model ("Alifold" if available) and the pfold model ("pfold" if avilable).

If you specify the inference engines multiply, centroid_alifold employs a mixtured baes-pairing probability matrix. The mixture weight can be set by -w or --mixture option. The default setting of centroid_alifold is -e McCaskill -w 1.0 -e Alifold -w 1.0. See more detail in (Hamada et al., 2010).

Example:

% centroid_alifold -g -1 RF00436.aln
>AB029447-1/1210-1265
--BCAHuUGYAVgUCGCUUUGGAYAaaAG--CGUCUGCUAAAUGM-VURwrukKAAAUDu-
............................................................. (g=0.03125,th=0.969697)
............................................................. (g=0.0625,th=0.941176)
............................................................. (g=0.125,th=0.888889)
............................................................. (g=0.25,th=0.8)
...............(((.........))..)............................. (g=0.5,th=0.666667)
..............(((((.......)))..))............................ (g=1,th=0.5)
............(.((((((.....))))..)).).......................... (g=2,th=0.333333)
........((..(.((((((.....))))..)).).))....................... (g=4,th=0.2)
...((((.((.((.((((((.....))))..)).))))..))))................. (g=6,th=0.142857)
...(((((((.((.((((((.....))))..)).)))).)))))................. (g=8,th=0.111111)
..((((((((.((.((((((.....))))..)).)))).))))))................ (g=16,th=0.0588235)
..((((((((.((.((((((.....))))..)).)))).))))))((.....))....... (g=32,th=0.030303)
..((((((((.((.((((((.....))))..)).)))).))))))((.....))....... (g=64,th=0.0153846)
..((((((((.((.((((((.....))))..)).)))).))))))((.....))....... (g=128,th=0.00775194)
..((((((((.((.((((((.....))))..)).)))).))))))(((...)))....... (g=256,th=0.00389105)
.(((((((((.((.((((((.....))))..)).)))).)))))).).((((....)))). (g=512,th=0.00194932)
.(((((((((.((.((((((.....))))..)).)))).)))))).)(((((....))))) (g=1024,th=0.00097561)

The first line of the result is the description of the first sequence in the given alignment. The second is the "most informative sequence" (Freyhult et al., 2005), which is similar to IUPAC ambiguity characters, produced by a library routine of the Vienna Package.

Secondary structure prediction using homologous sequence information

If homologous sequences to the target sequence are available, centroid_homfold can predict secondary structures for the target sequence more accurately than centroid_fold using homologous sequence information with the probabilistic consistency transformation for base-pairing probabilities (Hamada et al, 2009).

centroid_homfold [options] seq

Options:
  -H [ --homologous ] arg fasta file containing homologous sequences (REQUIRED)
  --engine_s arg          specify the inference engine for secondary structures
                          (default: "McCaskill")
  --engine_a arg          specify the inference engine for pairwise alignments 
                          (default: "CONTRAlign")
  -g [ --gamma ] arg      weight of base pairs
  --postscript arg        draw predicted secondary structures with the 
                          postscript (PS) format

You should give homologous sequences to the target sequence in the FASTA format by -H option.

Example:

% centroid_homfold -g -1 -H RF00005.fa seq.fa 
>X12857.1/421-494
GCGGAUGUAGCCAAGUGGAUCAAGGCAGUGGAUUGUGAAUCCACCAUGCGCGGGUUCAAUUCCCGUCAUUCGCC
.......................................................................... (g=0.03125,th=0.969697,e=0)
.......................................................................... (g=0.0625,th=0.941176,e=0)
.......................................................................... (g=0.125,th=0.888889,e=0)
.......................................................................... (g=0.25,th=0.8,e=0)
.......................................................................... (g=0.5,th=0.666667,e=0)
.......................................................................... (g=1,th=0.5,e=0)
.(((((.............................................(((.......)))...))))).. (g=2,th=0.333333,e=-1.21)
(((((((...........................................((((.......)))).))))))). (g=4,th=0.2,e=-10.8)
(((((((.....................((((.......))))......(((((.......)))))))))))). (g=6,th=0.142857,e=-17.9)
(((((((...(.............)..(((((.......))))).....(((((.......)))))))))))). (g=8,th=0.111111,e=-17.22)
(((((((..(((...........))).(((((.......))))).....(((((.......)))))))))))). (g=16,th=0.0588235,e=-23.8)
(((((((..((((.........))))(((((((.....))))))..)..((((((....).)))))))))))). (g=32,th=0.030303,e=-13.3)
(((((((..((((.(....)..))))((((((((...)))))))..)..((((((....).)))))))))))). (g=64,th=0.0153846,e=-8.9)
(((((((..((((((....)).))))((((((((...)))))))..)..((((((....).)))))))))))). (g=128,th=0.00775194,e=-10.6)
(((((((..((((((....)).))))((((((((...)))))))..)..((((((....).)))))))))))). (g=256,th=0.00389105,e=-10.6)
(((((((.(((((((....)).))))((((((((...)))))))..).)((((((....).)))))))))))). (g=512,th=0.00194932,e=-4.9)
(((((((.(((((((....)).))))((((((((...)))))))..).)((((((....).)))))))))))). (g=1024,th=0.00097561,e=-4.9)

References