[Question] What is the difference between predict_proba and log_probability methods for HMMs

jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.

http://pomegranate.readthedocs.org/en/latest/

MIT License

3.29k stars 590 forks source link

[Question] What is the difference between predict_proba and log_probability methods for HMMs #1089

Open ko62147 opened 3 months ago

ko62147 commented 3 months ago

Hello,

I fitted an HMM to a set of observation sequences, however I get positive log probability values (or probability values greater than 1) when I call the log_probability method on some test observation sequences. What does positive log probability values mean in the context of the HMM inference, and how is the log_probability method different from the predict_proba method?

jmschrei commented 2 months ago

predict_proba gives you the posterior probabilities that each observation aligns to each hidden state in the model given all of the other observations in the sequence. It's also called the forward-backward algorithm.

log_probability can be positive when you have continuous observations and have a distribution with a very small variance. For instance, if you have a normal distribution with a mean of 0 and a std of 0.0001, a value of 0 will have a probability above 1.

ko62147 commented 2 months ago

Thanks for the reply. I am trying to understand the physical meaning of the results from the log_probability method. Does it mean that situations where continuous observations return a positive log probability (or probability greater than 1) have complete certainty (i.e. 100% probability) that the observations/data are generated from the distribution/model?

jmschrei commented 2 months ago

I think you're entering one of the confusing areas of probability theory. Basically, just because a point estimate is above 1 doesn't mean that it's guaranteed to happen. For instance, in my example above, P(0.0001) would be above 1 but so would P(0.00011). Both can't be guaranteed to happen. Instead, people usually look at probabilites of events happening within ranges of a probability distribution and then set those ranges to be very small, e.g., (P(x+e) - P(x - e)) / 2e In my experience, the most practical interpretation of probabilities greater than 1 is that your model has overfit to something.

ko62147 commented 2 months ago

Understood. Thanks for the clarification. What do you recommend to reduce overfitting for HMMs?

ko62147 commented 2 months ago

I am fitting/training HMMs using time series (datetime) data transformed into radial basis functions or sine/cosine vectors scaled using min-max scaler. However, I keep obtaining positive log_probability values for some of the test sequence observations using these time series (datetime) transformations. Based on your experience:

What would you recommend to address the positive log_probability values returned for the test observation sequences?
What time series (datetime) transformation would you recommend for datetime observations to fit a HMM?
What do you recommend to eliminate overfitting in HMMs trained on these (continuous) observations?
Is it viable/reasonable to combine (transformed/preprocessed) datetime and binary features as observation sequences to fit/train a HMM?

jmschrei commented 2 months ago

Having positive log probability values isn't a problem that needs fixing. The math is still all valid, one just needs to know what it means and why.
If you're going to use values explicitly scaled to the 0-1 range you might want to use a distribution like a Beta (you'd have to implement your own) that is explicitly in that range. If you want negative log probabilities and are using a Normal distribution you might try a mean/std scaling instead.
It depends on the model parameters. What does the transition matrix look like? What are the distributions and what do their parameters look like?
Sure, just use https://github.com/jmschrei/pomegranate/blob/master/pomegranate/distributions/independent_components.py This class lets you pass in one univariate distribution for each feature and it can be a totally different distribution type. The one catch is that it doesn't learn covariance across any features.