lpdf for Hidden Markov models with discrete latent states

stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.

https://mc-stan.org

BSD 3-Clause "New" or "Revised" License

744 stars 187 forks source link

lpdf for Hidden Markov models with discrete latent states #1648

Closed charlesm93 closed 3 years ago

charlesm93 commented 4 years ago

Description

Consider a hidden Markov model with observations y, a discrete latent state x and model parameters psi. To fit this model in Stan, we need to integrate x out, and then compute gradients for the marginal density p(y | psi). Doing this by hand is lengthy and doesn't guarantee optimal computation of the gradients. Deriving an algorithm for the optimal derivatives is not trivial: it can be done as a byproduct of the EM algorithm or by constructing an adjoint method, a discretized version of what we do with ODEs.

I propose to implement an hmm_marginal_density_lpdf() function with optimal gradient calculations. In addition there will be two complementary functions: hmm_hidden_state_probs() which computes the marginal probabilities of the hidden states and hmm_hidden_state_rng() which simulates these hidden states.

Example

x is discrete and takes values over a finite set with K elements. Suppose we have N observations y, corresponding to N transitions. The user then defines

the K x K transition matrix Gamma, with (i, j)th entry being the probability that x(n) = i given x(n - 1) = j
the K x N observational density omega, with for each observation the probability density p(y | x) over each possible value of x.
The K-vector initial state, rho.

The user then calls

target += hmm_marginal_lpdf(log_omegas, Gamma, rho);

Matrix[K, N + 1] alphas = hmm_hidden_state_prob(log_omegas, Gamma, rho);

int x[N] = hmm_hidden_state_rng(log_omegas, Gamma, rho);

Expected Output

This feature will facilitate and optimize the use of hidden markov models in Stan.

Current Version:

v3.1.0

charlesm93 commented 4 years ago

Tagging @betanalpha.

charlesm93 commented 4 years ago

What is the best way of testing we get the right density? I can work it out analytically, but that would be for a simple case or we can benchmark against another software.

Checks to implement:

~the elements of log_omegas are all positive.~
the elements of Gamma are all positive
Gamma is a square matrix
The columns of Gamma sum to 1 [is this too expensive to check?]
the elements of rho are positive
n_states = log_omega.rows() matches rho.size() and Gamma.cols().

bob-carpenter commented 4 years ago

Why is omega on the log scale but not Gamma (stochastic matrix) or rho (initial distribution?).
Will there be a way to set the initial distribution to be the stationary distribution of Gamma?
No, it won't be too expensive to check that Gamma is a stochastic matrix and rho is a simplex; that time will be dwarfed by the rest of the computation of the HMM.
I've always seen the rows of Gamma be the simplexes, but it appears that there are two conventions for stochastic matrices in the literature (right stochastic and left stochastic). Conventionally HMMs have used the one where rows are simplexes so that Gamma[i, j] is the probability of transitioning from state i to state j. If you're going to do this the other way you'll have to make it very clear that's the convention. And you'll definitely need tests either way.

bob-carpenter commented 4 years ago

What is the best way of testing we get the right density?

SBC? Or an approximate form thereof?

You could also write a second implementation of the forward algorithm out for double valus to make sure you get the right answer. The problem with external software is that we won't be able to turn it into a unit test other than for fixed HMMs.

betanalpha commented 4 years ago

Why is omega on the log scale but not Gamma (stochastic matrix) or rho (initial distribution?).

Because the Stan language exposes log density functions where as Gamma and rho will be simplices on the linear scale for simpler models.

Will there be a way to set the initial distribution to be the stationary distribution of Gamma?

Because Gamma and rho are customizable the user can compute the eigenvectors of Gamma and set rho accordingly external to the HMM density.

No, it won't be too expensive to check that Gamma is a stochastic matrix and rho is a simplex; that time will be dwarfed by the rest of the computation of the HMM. rho will be cheap, but for time-dependent Gamma the checks could become substantial. For an initial implementation, however, we should be safe and worry about optimizations later.

I've always seen the rows of Gamma be the simplexes, but it appears that there are two conventions for stochastic matrices in the literature (right stochastic and left stochastic). Conventionally HMMs have used the one where rows are simplexes so that Gamma[i, j] is the probability of transitioning from state i to state j. If you're going to do this the other way you'll have to make it very clear that's the convention. And you'll definitely need tests either way.

We were working off of stats/machine learning textbooks that defined Gamma with simplex columns mirroring the linear algebraic action of the transition matrix on the previous state probabilities.

bob-carpenter commented 4 years ago

On Jan 30, 2020, at 4:49 PM, Michael Betancourt notifications@github.com wrote:

Why is omega on the log scale but not Gamma (stochastic matrix) or rho (initial distribution?).

Because the Stan language exposes log density functions where as Gamma and rho will be simplices on the linear scale for simpler models.

OK. That makes sense. I made the same decision with log_mix, but have been uncomfortable with that ever since since the implementation for log_mix(alpha, lp1, lp2) is just log_sum_exp(log(alpha) + lp1, log1m(alpha) + lp2), so you always wind up taking the logs anyway. The problem with linear is that it doesn't give you as much dynamic range. But then that's probably not such a big deal if there are vanishingly small mixture components.

Will there be a way to set the initial distribution to be the stationary distribution of Gamma?

Because Gamma and rho are customizable the user can compute the eigenvectors of Gamma and set rho accordingly external to the HMM density.

I think eventually it'd be nice to have a version with one fewer argument that does that automatically. I can do that once the first one's built.

No, it won't be too expensive to check that Gamma is a stochastic matrix and rho is a simplex; that time will be dwarfed by the rest of the computation of the HMM. rho will be cheap, but for time-dependent Gamma the checks could become substantial. For an initial implementation, however, we should be safe and worry about optimizations later.

I completely agree we should check. The plan's to rewrite checks so that they can be turned off wholesale.

I've always seen the rows of Gamma be the simplexes, but it appears that there are two conventions for stochastic matrices in the literature (right stochastic and left stochastic). Conventionally HMMs have used the one where rows are simplexes so that Gamma[i, j] is the probability of transitioning from state i to state j. If you're going to do this the other way you'll have to make it very clear that's the convention. And you'll definitely need tests either way.

We were working off of stats/machine learning textbooks that defined Gamma with simplex columns mirroring the linear algebraic action of the transition matrix on the previous state probabilities.

I'm not sure which texts you're looking at, but Bishop sets them up with rows being simplexes (p. 610). The Qin et al. paper on chemistry I really liked also made the rows simplexes. The Langrock et al. paper "Flexible and practical modeling of animal telemetry data" also sets up the stochastic matrix row-wise (that paper's the basis of the popular moveHMM package in R). It's also the way I set it up in the manual, although that says more about me than the external world. Who sets up the transition matrix with columns as simplexes?

charlesm93 commented 4 years ago

I'm fine changing the conventions for the gammas, and having the rows as simplexes.

For the unit test for the log density evaluation, I'll construct a simple case where there are only two states. It's straightforward to do this calculation analytically.

bob-carpenter commented 4 years ago

The more general thing to do would be to test 1-state, 2-state, and 3-state examples to make sure all the boundary conditions on the loops work.

Then tests inputs of size 0, 1, 2, and 3.

You can at least test that finite diffs match autodiff without any analytic work.

On Feb 28, 2020, at 10:54 AM, Charles Margossian notifications@github.com wrote:

I'm fine changing the conventions for the gammas, and having the rows as simplexes.

For the unit test for the log density evaluation, I'll construct a simple case where there are only two states. It's straightforward to do this calculation analytically.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

vianeylb commented 4 years ago

I agree with Bob. Having the rows as simplexes for the transition probability matrix is definitely how I'd also prefer to have it set up.

rok-cesnovar commented 3 years ago

Closed in #1778