mattjj / pyhsmm

MIT License
546 stars 173 forks source link

Questions #64

Closed jmgo closed 8 years ago

jmgo commented 8 years ago

Hi!

I've been looking to your examples and code and I was hoping that you could answer a few questions.

Right now, I trying to implement a very simple model: a 1D signal with 2 states and the observation models of each state is a univariate Gaussian. Then, I want to train the model and use the Viterbi algorithm to determine the states.

So I did the following:

create model

trans_matrix = transition_model() # 2x2 matrix mu_1, sigma_1, mu_2, sigma_2 = observational_model_pyhsmm()

N = 2 # number of states

obs_distns = [pyhsmm.distributions.ScalarGaussianNIX(mu_1, sigma_1^2), pyhsmm.distributions.ScalarGaussianNIX(mu_2, sigma_2^2)]

pi_0 = np.array([0.8, 0.2])

Build the HMM model

model = pyhsmm.models.HMM( trans_matrix=trans_matrix, pi_0 = pi_0, obs_distns=obs_distns)

fit model

for sample in all_samples: model.add_data(sample)

model.EM_fit(tol=1e-9, maxiter=1e8)

test model

pred = model.heldout_viterbi(distance_frame)

My questions are: 1) Am I passing the transition and initial distributions (in the form of array) correctly to the model? 2) There are other Gaussian models, like 'ScalarGaussianNonconjNIX', 'ScalarGaussianNonconjNIG'x, 'ScalarGaussianFixedvar', what are the differences between these? 3) I have several signals for training, do I have to pass each signal one by one, or is there a method to pass all the signal at once? 4) To use the Viterbi algorithm, is 'heldout_viterbi' the correct method?

5) For a observation model of mixture of multivariate Gaussians the examples present the following parameters:

obs_hypparams = {'mu_0':np.zeros(obs_dim), 'sigma_0':np.eye(obs_dim), 'kappa_0':0.25, 'nu_0':obs_dim+2}

a) Is kappa_0 the weight of each component of the GMM? b) Is nu_0 the number of components of the GMM? c) What is the difference of mu, sigma and mu_0 and sigma_0?

Sorry for the long post. Regards, jmgo

danstowell commented 8 years ago

I don't know this code very well, but I might be able to answer your final question 5.

There's no GMM in the code you quote, I think - instead it's about a single Gaussian observation model.

mu_0 and sigma_0 and kappa_0 and nu_0 are the hyperparameters for the priors about a Gaussian - i.e. they define the parameters for the conjugate priors. The first two are about the location of the mean, and the last two are about the covariance.

Then, if you take the _0 off the name, you're probably talking about the posterior values. So mu and sigma tell you where we believe the actual mean is.

mattjj commented 8 years ago

Sorry for the delay. You've probably figured out most of the answers, but these answers might serve as documentation for anyone else who may be interested.

1) Am I passing the transition and initial distributions (in the form of array) correctly to the model?

Yes. There are several ways to do it, and passing in arrays like that works.

2) There are other Gaussian models, like 'ScalarGaussianNonconjNIX', 'ScalarGaussianNonconjNIG', 'ScalarGaussianFixedvar', what are the differences between these?

The differences in the first two are the parameterization of the prior. ScalarGaussianNonconjNIX uses a normal-inverse-chi-squared parameterization of the prior on variances (hence the "NIX"), while ScalarGaussianNonconjNIG uses a normal-inverse-gamma parameterization. Having different parameterizations means the distribution is the same but the calculations are different; it's working in a different coordinate system. I probably wrote the NIX version first because that may have been easier for sampling (and maybe it's the version that Gelman et al. follow in Bayesian Data Analysis), but the NIG version is the 'natural parameterization' and hence is easier for mean field methods. It looks like the NIX class supports Gibbs sampling and not variational methods while the NIG class supports variational methods and not sampling (you can tell from the classes they inherit from and the methods they define).

ScalarGaussianFixedvar is for scalar Gaussian observations but where the variance is fixed. Both of the other classes, ScalarGaussianNonconjNIX and ScalarGaussianNonconjNIG, allow you to learn the variance as well as the mean, while with ScalarGaussianFixedvar you fix the variance and only learn the mean.

3) I have several signals for training, do I have to pass each signal one by one, or is there a method to pass all the signal at once?

You'd have to call add_data in a loop, adding them one-by-one. I can't remember if I had a good reason for doing that; maybe we should add a wrapper method.

4) To use the Viterbi algorithm, is 'heldout_viterbi' the correct method?

It depends on what you want to use the Viterbi algorithm for. If you want to use it for making predictions on new data sequences, then heldout_viterbi is a decent method to use. It fixes the parameters (e.g. on the current sampled values). If you want to use Viterbi for training, like Viterbi EM (which also corresponds to some small variance asymptotics-based algorithms), then look at the Viterbi_EM_fit method in the _HMMViterbiEM base class. If you want to use Viterbi to decode the training sequences (as opposed to new held-out data sequences), then you can use the _Viterbi_E_step method, which does the same as heldout_viterbi would but you don't have to pass in the data sequences.

5) For a observation model of mixture of multivariate Gaussians the examples present the following parameters: [...]

@danstowel is right on this one. I suggest looking at Wikipedia and Gelman et al.'s Bayesian Data Analysis for a detailed discussion (plus also Ch. 2 of my PhD thesis, specifically Examples 2.2.5 and 2.2.9), but briefly, those hyperparameters are just for the normal-inverse-Wishart (NIW) prior for the Gaussian mean and covariance parameters (and there's no GMM being used).

The _0 naming comes from a Bayesian statistics convention. Let's say you have some multivariate Gaussian data and you want to infer the mean and covariance parameters in a Bayesian way. Before observing any data, you might start with a NIW prior on the mean and covariance,

(mu, Sigma) ~ NIW(nu_0, S_0, mu_0, kappa_0),

so the parameters of your prior are (nu_0, S_0, mu_0, kappa_0) and the _0 subscript means they aren't informed by any data yet. After observing n observations x_1, x_2, ..., x_n that are generated by

x_i | mu, Sigma ~ Normal(mu, Sigma)

independently for i=1,2,...,n, then you would might write the posterior random variable as

(mu, Sigma) | x_1, x_2, ..., x_n ~ NIW(nu_n, S_n, mu_n, kappa_n).

That is, the posterior is still in the NIW family (that's what it means to be a conjugate prior), and to distinguish the new parameters after seeing n datapoints we've written them with _n subscripts.

One more thing to keep in mind is that you can write formulas for the posterior hyperparameters in terms of the prior hyperparameters and some statistics of the data x_1, ..., x_n. Looking at those update formulas can help you interpret the meaning of the prior hyperparameters; they often take the form of "pseudo-statistics" or "pseudo-counts", meaning you can interpret the prior as corresponding to some pseudo-dataset you're pretending to have observed.

One more thing to keep in mind: there's a funny naming convention in most of pybasicbayes that seems unnatural to me now. Basically, distribution classes actually model prior-likelihood pairs (usually conjugate pairs), but they're often only named after the likelihood part. For example, the Gaussian class really models a Gaussian likelihood along with a conjugate normal-inverse-Wishart prior, and so the parameters of both are all collected into that one class. A better name for that class would be to call it GaussianNormalInverseWishart. (Some friends and I have long planned on a rewrite, which will have not only better names but a better organization and more powerful abstractions, but we haven't gotten to it.)