MixtureModelHMM implements Hidden Markov Model a probabilistic model used to analyze sequencial data. It is used to post-process output of phylogenetic mixture models from IQ-TREE. The package implements Baum-Welch
Algorithm to train the HMM model on given input file. The Baum-Welch algorithm is a dynamic programming approach and a special case of the expectation-maximization algorithm (EM algorithm). Its purpose is to tune the parameters of the HMM, namely the state transition matrix, the emission matrix, and the initial state distribution, such that the model is maximally like the observed data. The input files can be either the site likelihood or site probability file and the alignment information file. Once the HMM is trained the model could be used to determine the final class associated with each sites using dynamic programming algorithm. The default algorithm for predicting class boundries is set to veterbi
. The output of the function returns an object class consisting of vector for each classes assigned to a given class in a vector form along with a prediction plot and hmm transition table.
First install the devtools
package if you don't already have it:
install.packages("devtools")
Then you can install MixtureModelHMM from this repository like this:
library(devtools)
install_github("roblanf/MixtureModelHMM")
If you want to put this package into a Conda environment, here's a way that works for me. Please drop an issue if you know of a simpler way!
First we make a conda environment and fire it up like this
conda create --name hmm
conda activate hmm
Now we install the devtools
R package like this
conda install -c conda-forge r-devtools
And when that's done, I'm going to make a 2-line bash script with the commands above. I use nano because it's on a lot of linux environments, but obviously you can use any editor to do this. All you need is a way to end up with a script called installhmm.R
which has the two final lines from the R installation above.
First start nano editing the file we want:
nano installhmm.R
Now you're in nano, copy paste these two lines into your editor:
conda create --name hmm
conda activate hmm
Now type CTRL-x (i.e. hold down CTRL and press the 'x' key), and nano will ask if you want to save the file. Type 'y'.
Now if you ls -lh
you should see a file called installhmm.R
, and if you type more installhmm.R
you should see the two lines above. If that all looks OK, you can then run that R script like this:
Rscript installhmm.R
This will fire up R and run those two lines. If that all completes successfully, you have the pacakge installed.
If you already have an IQ-TREE analysis where you've produced the following files:
mydata.siteprob
or mydata.sitelh
mydata.alninfo
You can run the HMM as follows:
library("MixtureModelHMM")
hmm_result <- run_HMM(site_info = "mydata.sitelh", aln_info = "mydata.alninfo")
Then you can view the key plot like this:
hmm_result$alignment_plot
and write a report on the HMM with a lot more information like this:
save_report(hmm_result = hmm_result, output_filename = "hmm_report.Rmd")
If you have your files, and just want the answer without using R, you can use the google colab notebook here: MixtureModelHMM_notebook
Let's start with an alignment of 32 species and 10,000 sites. This is the example.phy
alignment in this repository, which you can download at this link
The MixtureModelHMM package will work with the output of any mixture model from IQ-TREE. For this example, we'll focus on the GHOST model, which is a model that fits mixtures of branch lengths to phylogenetic datasets. To keep this worked example simple, I simulated the example.phy
dataset above under a GHOST model with 4 classes, and a GTR substitution model.
NB: when you are running mixture models on your own data (be that GHOST models or any other mixture model) you should pay close attention to selecting a good model. I used at
GTR+H4
model here because I simulated the data under that model. Usually you wouldn't know the best model, and would want to spend some time figuring out the best avaialable model from your data. I keep it simple here so that we can focus on running the HMM.
The MixtureModelHMM
package requires a couple of additional files from IQ-TREE that provide information on the likelihood of each site under each model class (the .sitelh
file) and on some additional features of each site (the .alninfo
file). So, to use the MixtureModelHMM
package you'll need to add a couple of things to your usual IQ-TREE command line:
iqtree -s example.phy -m GTR+H4 -wslm -wspm -alninfo
This analysis will take around 10 minutes (depending on your computer). If you'd like to skip ahead, you can download the output files from this analysis from the worked_example folder.
The options we used above are:
-s
points to the input alignment file-m
specifies the model. Here we've chosen a GTR model with a 4-class GHOST model (+H4
), because I know that's the best model-wslm
writes out the site likelihoods under each class to the .sitelh
file-wspm
writes out the site probabilities under each class to the .siteprob
file-alninfo
writes out a additional information about sites to the .alninfo
file, for example which sites are constant, constant but ambiguous, uninformative, or have equal parsimony scores.Now we have all the files we need for our HMM analysis:
example.sitelh
: contains site likelihoodsexample.siteprob
: contains site probabilitiesexample.alninfo
: contains all the additional information on alignment sites that the HMM needsplot_scatter()
The point of the HMM is to take site likelihoods or probabilities, and leverage the fact that neighbouring sites in the alignment are likely to belong to the same class to determine tracts of an alignment that belong to the same class.
Before running the HMM, it's a really good idea to just look at the raw data from IQ-TREE, to see for yourself if neighbouring sites really do have similar class assignments. For this you can use the plot_scatter()
function as follows:
library("MixtureModelHMM")
plot_scatter("example.phy.siteprob")
NB: you can also plot the site likelihoods in the same way, but you'll see if you try it that these are far less informative!
The functions implements geom_smooth()
with default span=0.03
.\
span
can be adjusted as suitable where smaller numbers produce wigglier lines, larger numbers produce smoother lines.
The command above will give you the following plot:
Each dot in this plot is a single site in the alignment. The y-axis shows the posterior probability that a site belongs to a given class. This plot shows a few things very clearly. First, neighbouring sites really do seem to share class assignments. For example, sites 1 to about 1250 seem to belong to class p3
(the blue line). Although the data are very noisy. Second, there are a small number of clear transitions - if you look at which class has the highest probability in the smoothed lines, you'll see that it transitions from p3 to p2, then back to p3, then to p1, and so on.
This is sufficient to show that it's sensible to run an HMM on these data, so let's go ahead and run it.
run_HMM
Now we know it's sensible to run an HMM, we can run it in R like this:
hmm_result <- run_HMM(site_info = "example.phy.sitelh", aln_info = "example.phy.alninfo")
the run_HMM
function takes two files as input:
site_info
: here you can pass either the .sitelh
or the .siteprob
file from your IQ-TREE analysis. In almost all of our tests and simulations, we found that the HMM performs better using site likelihoods (.sitelh
file) rather than site posterior probabilities (the .siteprob
file), so we recommend using it here.aln_info
: here you have to point to a .alninfo
file from IQ-TREE.Other optional parameters:
model
: Select a model from 1-4. Defaults to model 4.iter
: The maximum number of EM iterations to carry out before the cycling process is terminated and the partially trained model is returned. Defaults to 100. The maximum change in log likelihood between EM iterations before the cycling procedure is terminated is set to 1E-07.algorithm
: Select dynamic algorithm to predict class boundaries from the trained HMM model parameters.The HMM trys to figure out how to assign every site in the alignment to a class, using the fact that neighbouring tend to be in the same class to help it. One way to think about this is that the HMM is a way of cleaning up the noisy signal in the plot above. Along the way the HMM accounts for issues associated with phylogenetic data, such as the fact that constant sites don't contain much useful information.
You can get a quick summary of the hmm result like this:
summary(hmm_result)
which in this case will show:
[1] "Input files: example.phy.sitelh example.phy.alninfo"
[1] "Number of sites: 10000"
[1] "Number of classes: 4"
[1] "Sites in each class: "
C1 C2 C3 C4
2500 2497 2500 2503
[1] "Number of transitions: 8"
[1] "Algorithm used to determine the sequence: viterbi"
Most people will be primarily interested in three things from the HMM. Here they are.
alignment_plot
hmm_result$alignment_plot
Will show you this plot:
This plot shows the alignment sites along the x-axis. On the top panel it shows the maximum likelihood (or posterior probability, if that's the file you used as input) class for each site in your alignment. This is the input data for the HMM. On the bottom panel it shows the output of the HMM - i.e. which classes the HMM has assigned each site to. You can compare this plot to the scatter plot above, and see how the HMM cleans up the noisy signal in the input data.
hmm_transition_table
The plot above shows that the HMM has a few transitions between classes. If we go from right to left, the classes go: 3, 2, 3, 1, 2, 1, 4, 2, 4. The hmm_transition_table
shows you exactly which sites the transitions occurred at.
hmm_result$hmm_transition_table
Will show you this:
site class_from class_to
1 1501 C3 C2
2 2301 C2 C3
3 3301 C3 C1
4 4301 C1 C2
5 5101 C2 C1
6 6601 C1 C4
7 7103 C4 C2
8 8000 C2 C4
classification
classification
is a vector where you can look up the HMM classification of every single site in your alignment.
For example, if we wanted to see the transition between class 3 and class 2, then we might want to look at the sites around site 1501 (the location of the transition above). We could do that as follows:
hmm_result$classification[1495:1505]
This will show you the following:
[1] "C3" "C3" "C3" "C3" "C3" "C3" "C2" "C2" "C2" "C2" "C2"
And you can see that at site 1500 the output is C3
, and it switches to C2
at site 1501.
Obviously you can output anything you like about the HMM from the hmm_results
object (full details below).
The MixtureModelHMM
package has a few functions to write out specific files that we hope are helpful.
save_report()
This will save a Rmd and pdf file that contains a lot of useful information about your results:
save_report(hmm_result = hmm_result, output_filename = "hmm_report")
save_partitioning_scheme()
If you want to output the information from the HMM as a partitioning scheme in NEXUS format, you can do this:
save_partitioning_scheme(hmm_result = hmm_result, output_filename = "hmm_partitions.nex")
The hmm_result
object is an object of class 'MixtureModelHMM'. This object contains all the information we think you might ever be interested in.
$classification
: a vector the same length as your alignment, which contains the final classification for each site$data
: the input data loaded via the site_info
argument to run_HMM()
$trained_hmm
: an object of class HMM
, which describes the final Hidden Markov Model you trained$algorithm
: the name of the algorithm used to get the final path through the HMM$site_input_file
: the name of the site input file loaded via site_info
to run_HMM()
$aln_input_file
: the name of the alignment input file loaded via aln_info
to run_HMM()
$alignment_plot
: a plot that shows the input data vs. what the HMM inferred$transition_plot
: a plot that shows the structure of the final HMM$hmm_transition_table
: a table that describes each inferred transition from the HMMWe can either use posterior probability for each site(.siteprob file) or log-likelihood for each site(.sitelh file) as an input for run_HMM()
.
Step 1: We create a sequence of classes with maximum log-likelihood(or posterior proabilility) for each given site.
$Cx_1,Cx_2,Cx_3,...,Cx_m$\
each $x_i = {\operatorname{argmax}}\set{Var_1,Var_2,Var_3,...,Var_n}$\
where $Var_i$ = LnLW_i or p_i from site_info
file, $n$ = number of classes and $m$ = number of sites\
The above sequence is used to train Baum-Welch algorithm.
Step 2: Initialize HMM with initial probabilities to start training.
Hidden Markov Model consists of transition probability and emission probability.\ We define states as classes and emissions as classes along with additional states.
Initialize transition probabilities
Initialize Emission probabilities
Step 3: We train the parameters and get final transition/emission probs.
Step 4: Use the final parameters to predict class boundaries using dynamic programming algorithms - Viterbi or posterior decoding(forward-backward).