roblanf / MixtureModelHMM

Post-processes the output of phylogenetic mixture models using HMMs
GNU General Public License v2.0
0 stars 1 forks source link

Table of contents

What is MixtureModelHMM

MixtureModelHMM implements Hidden Markov Model a probabilistic model used to analyze sequencial data. It is used to post-process output of phylogenetic mixture models from IQ-TREE. The package implements Baum-Welch Algorithm to train the HMM model on given input file. The Baum-Welch algorithm is a dynamic programming approach and a special case of the expectation-maximization algorithm (EM algorithm). Its purpose is to tune the parameters of the HMM, namely the state transition matrix, the emission matrix, and the initial state distribution, such that the model is maximally like the observed data. The input files can be either the site likelihood or site probability file and the alignment information file. Once the HMM is trained the model could be used to determine the final class associated with each sites using dynamic programming algorithm. The default algorithm for predicting class boundries is set to veterbi. The output of the function returns an object class consisting of vector for each classes assigned to a given class in a vector form along with a prediction plot and hmm transition table.

Installation

In R or R Studio

First install the devtools package if you don't already have it:

install.packages("devtools")

Then you can install MixtureModelHMM from this repository like this:

library(devtools)
install_github("roblanf/MixtureModelHMM")

Using Conda

If you want to put this package into a Conda environment, here's a way that works for me. Please drop an issue if you know of a simpler way!

First we make a conda environment and fire it up like this

conda create --name hmm
conda activate hmm

Now we install the devtools R package like this

conda install -c conda-forge r-devtools

And when that's done, I'm going to make a 2-line bash script with the commands above. I use nano because it's on a lot of linux environments, but obviously you can use any editor to do this. All you need is a way to end up with a script called installhmm.R which has the two final lines from the R installation above.

First start nano editing the file we want:

nano installhmm.R

Now you're in nano, copy paste these two lines into your editor:

conda create --name hmm
conda activate hmm

Now type CTRL-x (i.e. hold down CTRL and press the 'x' key), and nano will ask if you want to save the file. Type 'y'.

Now if you ls -lh you should see a file called installhmm.R, and if you type more installhmm.R you should see the two lines above. If that all looks OK, you can then run that R script like this:

Rscript installhmm.R

This will fire up R and run those two lines. If that all completes successfully, you have the pacakge installed.

QuickStart

If you already have an IQ-TREE analysis where you've produced the following files:

You can run the HMM as follows:

library("MixtureModelHMM")
hmm_result <- run_HMM(site_info = "mydata.sitelh", aln_info = "mydata.alninfo")

Then you can view the key plot like this:

hmm_result$alignment_plot

and write a report on the HMM with a lot more information like this:

save_report(hmm_result = hmm_result, output_filename = "hmm_report.Rmd")

Google Colab notebook

If you have your files, and just want the answer without using R, you can use the google colab notebook here: MixtureModelHMM_notebook

Worked example

Input data

Let's start with an alignment of 32 species and 10,000 sites. This is the example.phy alignment in this repository, which you can download at this link

Getting the right files from an IQ-TREE mixture model analysis

The MixtureModelHMM package will work with the output of any mixture model from IQ-TREE. For this example, we'll focus on the GHOST model, which is a model that fits mixtures of branch lengths to phylogenetic datasets. To keep this worked example simple, I simulated the example.phy dataset above under a GHOST model with 4 classes, and a GTR substitution model.

NB: when you are running mixture models on your own data (be that GHOST models or any other mixture model) you should pay close attention to selecting a good model. I used at GTR+H4 model here because I simulated the data under that model. Usually you wouldn't know the best model, and would want to spend some time figuring out the best avaialable model from your data. I keep it simple here so that we can focus on running the HMM.

The MixtureModelHMM package requires a couple of additional files from IQ-TREE that provide information on the likelihood of each site under each model class (the .sitelh file) and on some additional features of each site (the .alninfo file). So, to use the MixtureModelHMM package you'll need to add a couple of things to your usual IQ-TREE command line:

iqtree -s example.phy -m GTR+H4 -wslm -wspm -alninfo

This analysis will take around 10 minutes (depending on your computer). If you'd like to skip ahead, you can download the output files from this analysis from the worked_example folder.

The options we used above are:

Now we have all the files we need for our HMM analysis:

Plotting the raw data with plot_scatter()

The point of the HMM is to take site likelihoods or probabilities, and leverage the fact that neighbouring sites in the alignment are likely to belong to the same class to determine tracts of an alignment that belong to the same class.

Before running the HMM, it's a really good idea to just look at the raw data from IQ-TREE, to see for yourself if neighbouring sites really do have similar class assignments. For this you can use the plot_scatter() function as follows:

library("MixtureModelHMM")
plot_scatter("example.phy.siteprob")

NB: you can also plot the site likelihoods in the same way, but you'll see if you try it that these are far less informative!

The functions implements geom_smooth() with default span=0.03.\ span can be adjusted as suitable where smaller numbers produce wigglier lines, larger numbers produce smoother lines.

The command above will give you the following plot:

a scatter plot showing a clear pattern

Each dot in this plot is a single site in the alignment. The y-axis shows the posterior probability that a site belongs to a given class. This plot shows a few things very clearly. First, neighbouring sites really do seem to share class assignments. For example, sites 1 to about 1250 seem to belong to class p3 (the blue line). Although the data are very noisy. Second, there are a small number of clear transitions - if you look at which class has the highest probability in the smoothed lines, you'll see that it transitions from p3 to p2, then back to p3, then to p1, and so on.

This is sufficient to show that it's sensible to run an HMM on these data, so let's go ahead and run it.

Running the HMM with run_HMM

Now we know it's sensible to run an HMM, we can run it in R like this:

hmm_result <- run_HMM(site_info = "example.phy.sitelh", aln_info = "example.phy.alninfo")

the run_HMM function takes two files as input:

Other optional parameters:

The HMM trys to figure out how to assign every site in the alignment to a class, using the fact that neighbouring tend to be in the same class to help it. One way to think about this is that the HMM is a way of cleaning up the noisy signal in the plot above. Along the way the HMM accounts for issues associated with phylogenetic data, such as the fact that constant sites don't contain much useful information.

You can get a quick summary of the hmm result like this:

summary(hmm_result)

which in this case will show:

[1] "Input files:  example.phy.sitelh example.phy.alninfo"
[1] "Number of sites:  10000"
[1] "Number of classes:  4"
[1] "Sites in each class: "

  C1   C2   C3   C4 
2500 2497 2500 2503 
[1] "Number of transitions:  8"
[1] "Algorithm used to determine the sequence:  viterbi"

The key output from the HMM

Most people will be primarily interested in three things from the HMM. Here they are.

alignment_plot

hmm_result$alignment_plot

Will show you this plot:

a plot of the hmm output along the alignment

This plot shows the alignment sites along the x-axis. On the top panel it shows the maximum likelihood (or posterior probability, if that's the file you used as input) class for each site in your alignment. This is the input data for the HMM. On the bottom panel it shows the output of the HMM - i.e. which classes the HMM has assigned each site to. You can compare this plot to the scatter plot above, and see how the HMM cleans up the noisy signal in the input data.

hmm_transition_table

The plot above shows that the HMM has a few transitions between classes. If we go from right to left, the classes go: 3, 2, 3, 1, 2, 1, 4, 2, 4. The hmm_transition_table shows you exactly which sites the transitions occurred at.

hmm_result$hmm_transition_table

Will show you this:

  site class_from class_to
1 1501         C3       C2
2 2301         C2       C3
3 3301         C3       C1
4 4301         C1       C2
5 5101         C2       C1
6 6601         C1       C4
7 7103         C4       C2
8 8000         C2       C4

classification

classification is a vector where you can look up the HMM classification of every single site in your alignment.

For example, if we wanted to see the transition between class 3 and class 2, then we might want to look at the sites around site 1501 (the location of the transition above). We could do that as follows:

hmm_result$classification[1495:1505]

This will show you the following:

 [1] "C3" "C3" "C3" "C3" "C3" "C3" "C2" "C2" "C2" "C2" "C2"

And you can see that at site 1500 the output is C3, and it switches to C2 at site 1501.

Output options

Obviously you can output anything you like about the HMM from the hmm_results object (full details below).

The MixtureModelHMM package has a few functions to write out specific files that we hope are helpful.

save_report()

This will save a Rmd and pdf file that contains a lot of useful information about your results:

save_report(hmm_result = hmm_result, output_filename = "hmm_report")

save_partitioning_scheme()

If you want to output the information from the HMM as a partitioning scheme in NEXUS format, you can do this:

save_partitioning_scheme(hmm_result = hmm_result, output_filename = "hmm_partitions.nex")

Everything that's in the output

The hmm_result object is an object of class 'MixtureModelHMM'. This object contains all the information we think you might ever be interested in.

Description of the approach

Flowchart

Flowchart

We can either use posterior probability for each site(.siteprob file) or log-likelihood for each site(.sitelh file) as an input for run_HMM().

Hidden Markov Model consists of transition probability and emission probability.\ We define states as classes and emissions as classes along with additional states.