jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.37k stars 587 forks source link

How to improve hmm prediction accuracy #472

Closed B4marc closed 6 years ago

B4marc commented 6 years ago

For two months now, I am testing hmm on my data. But I am running out of ideas. Thus, I hope one of you can help me.

My continuous data have 4 dimensions and I have 4 samples of 500 observations and labels between 0-2 (so 3 in total -> 3 states). I defined my states with 3 MultivariateGaussianDistribution, even though it is not the best Distribution for these data, as I figured out (by testing and visualising) #190. . Although

it's not likely to work well if you just throw your data at a randomly initialized dense HMM

141 I tried out the „from_sample“ method. First, as unsuperviced learning, which – according to other issues- results -in most of the cases- in better fitting than superviced learning. After that I applied a superviced learning and then builded up the structure manually, as you can see in the following code:

#correct labels
state_name=['s0','s1','s2']

# Label_array= dimension list:int32[2000,1]
# Emission   = dimension list[4x ndarray[500,4]]
# Label_1                =  dimension [500,1] of set =[0,1,2]
# Emission_1  = dimension[500,1]
# prediction        = model.prediction(Emission_1)
# accuracy = sklearn.metrics.accuracy_score(label_1,prediction)

lab=[[state_name[Label_array[i][0]]] for i in range(len(Label_ array))]

model_0 = HiddenMarkovModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=Emission)

# after matching the right states with np.vectorize and my_dict = {0:1,1:2,2:0} exemplary
#accuracy =0.7489421720733427

Model_1= HiddenMarkovModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=Emission,labels=lab)
# accuracy =0.6572637517630465

Model_2 = HiddenMarkovModel.from_samples(MultivariateGaussianDistribution, n_components=3, X=Emission,labels=lab,algorithm='labeled')
# accuracy =0.7418899858956276

model_3 = HiddenMarkovModel.from_samples(MultivariateGaussianDistribution, n_components=3,state_names=['s0','s1','s2'], X=Emission T,labels=lab,algorithm='labeled')
# accuracy =0.7418899858956276

# since the transition_matrix of model_3 and model_2 are evenly revealed
I tried out the fitting method:

Model_3.fit(Emission,algorithm=‘baum-welch‘, verbose=True)
Model_3.dense_transition_matrix()
# transition_matrix changed
prediction_part1=model_3.predict(Emission_1)
# accuracy gets worse 0.6572637517630465
# Model_1.from_sample with 'baum-welch' and then fitting 'labeled' -> no accuracy change: 0.6572637517630465

# defineing structure by using a confusion_matrix of label_1 for the transition_matrix and optimizing the transition_matrix parameters manually

s0=model_part1_0.states[0]
s1=model_part1_0.states[1]
s2=model_part1_0.states[2]

model_test = HiddenMarkovModel()
model_test.add_states(s0, s1, s2)
model_test.add_transition(model_test.start, s2, 1.0)
model_test.add_transition(s0, s0, 0.004)
model_test.add_transition(s0, s2, 0.417499777799998135)
model_test.add_transition(s1, s1, 0.004)
model_test.add_transition(s1, s2, 0.66500000000000023)
model_test.add_transition(s2, s0, 0.002)
model_test.add_transition(s2, s1, 0.002)
model_test.add_transition(s2, s2, 0.7470000000000005)
model_test.add_transition(s2, model_test.end, 0.1)
model_test.bake()
model_test.dense_transition_matrix()
# transition_matrix different comparing to the other ones before
# accuracy 0.9548660084626234

model_test1= model_test
model_test2= model_test
model_test1.fit(Emission,algorithm='baum-welch', verbose=True)
prediction_test=model_test.predict(Emission_1)
print(accuracy_score(label_1,prediction_test))
# Total Training Improvement: 18844.366279767004
#accuracy 0.8039492242595204

model_test2.fit(Emission,labels=lab,algorithm='labeled', verbose=True)
prediction_test=model_test.predict(Emission_1)
print(accuracy_score(label_1,prediction_test))
# Total Training Improvement: 0.0
#accuracy 0.9548660084626234

Don’t misunderstand me, I am totally ok with this last result, but in case of more states this proceeding is not applicable, or just with a lot of effort. Thus, following questions come up: 1) Why is the transition matrix not “optimal” learned by the „from_sample“ method? Is there a “simple/general” reason? 2) how can the „baum-welch“ - fitting reducing the accuracy?

3) As described in #190.

I tried out Naive Bayes Classifier and it is working out better than my current hmm. Regarding the Distribution, UniformDistribution fitted the best with Naive Bayes Classifier” So, does it make sense to use a different distribution? Using IndependentComponentsDistribution with 4 UniformDistributions? Is this goal reachable by detecting the most significant emission dimension for each state and define each MultivariateGaussianDistribution with just this state specific dimension? (this is currently producing an error) Or do I have to use the IndependentComponentsDistribution with 1 non-zeroed (out of 4, whereby 3 of them are zeroed) NormalDimension/ Uniformdistributions?

  1. Looking at the results of "model.predict_proba(Emission_1)", one can see, that the wrong state prediction probability is two low for the true state but switching after one observation. Since the observations are following a certain sequence (every time), does this mean, that the distributions should be differently defined? If yes, could it be improved by using a different statistical distance measure at the kmean method (lake Taxicab geometry instead of L2, because of dimensionality?) Or using a GMM instead of a MVG?

I am gratefully thankful for any hint! :)

jmschrei commented 6 years ago

Howdy

It's hard for me to give good feedback as to how you could improve your model when I know nothing about your data set.

  1. Presumably one of the reasons that you are getting good performance when you manually tuned the model is because you used the testing data set to define your transition probabilities. You should be using a transition matrix defined from lab, not Label1, or else you risk having your test set bleed into your training set.

I'm not sure where you got the values for your transition matrix but the probabilities in all the transitions leaving a state should sum to 1. pomegranate will auto-normalize these values if that's not the case, but you should be aware of the issue in the future. For example:

model_test.add_transition(s1, s1, 0.004)
model_test.add_transition(s1, s2, 0.66500000000000023)

the sum of the out-edges here is ~0.669 when it needs to be 1.0.

A reason it may not be "optimal" is that it is fitting well to the training set, but this may be overfitting and so inference may not generalize well to other data sets. This is where the concept of regularization comes in, either in the form of smoothing parameters, or in graphical models, totally eliminating edges. In the model you hand-wrote, you've eliminated the edges from s0 to s1 and from s1 to s0. This might help the model generalize better.

  1. It can reduce accuracy when your labels don't correspond to clusters in the training set. You might consider projecting your data down into two dimensions and seeing if the labels correspond with clusters or if they are interspersed.

  2. I really can't say without knowing more about the data. I think the best approach for you is to understand your data better and understand the appropriate distribution, rather than trying many things and hoping that something works out. If a naive Bayes classifier works well then perhaps neither a transition matrix nor a covariance matrix across features are needed to model your data well. An easy way to understand your data is to plot histograms of each feature. If you're seeing a bell curve then a uniform distribution is probably not appropriate. If your data is categorical than neither a normal nor a uniform distribution are appropriate.

  3. I don't understand what you mean. If you're saying that the labels always go in the sequence "A A B A A B" then you can force this in your model structure. This would be a form of regularization that can give you improved performance.

B4marc commented 6 years ago

Hi jmschrei, thank you very much for your quick and long answer! I am still on it, but I will not be able to work on my problem before friday.

B4marc commented 6 years ago

Hi jmschrei,

I thought it might be better to see, if I can handle the problem in a different way. But I was not successful by now. Coming back to the questions:

1) It was actually my intention to overfit my model, because I thought in that way I can find mistakes -I made- more easily and know that I should be able to reach nearly 100% accuracy (independently of the data); as the Naïve Bayes Classifier was appraoching. Which I haven’t been able to reach until now. I tested the model with different data as well and the accuracy stays at nearly 93% which is fine and which I am indicating as having a sufficient transition matrix. (I generated a transition matrix with lab and got the same transition matrix, since the label-sequence is in each sample the same. Regarding the out-edges which are not summing up to 1: I trusted the auto-normalize to handle this and was a bit lazy.)

2) This is a good hint, since the Naïve bayes Classifier results were good, didn’t think about that. I going to look at it.

3) Concerning the data. Even though, LogNormalDistribution would match better to the hists, the hmm with MultivariateGaussianDistribution makes the best results instead of a IndependentComponentsDistribution(LogNormalDistributions). That is kind of strange to me.

4) Yes, that’s what I meant. How can I force this into the structure? Is there another way except the transition matrix or a transition matrix of a higher order?

However, my features dimensions were already a reduced form. If I use the unreduced feature-form with dimensionality nearly to 100, the model’s behaviour is getting very interesting.

5) Clustering the emissions of the Samples according to their labeled state and fitting a MultivariateGaussianDistributionto each sample cluster returns 3 MultivariateGaussianDistribution. I plotted the results by MultivariateGaussianDistribution[i].log_probability(clustered_samples) and the spectrum (from the clustered_emissions). Each activity (shown in the spectrum) is perfectly recognized by the state-distribution (belonging to that activity) by returning the highest Log_probabiltiy (in this case Log_probability returns the Likelihood 491#, right?), during this activity. But if I use this fitted MultivariateGaussianDistributionin the hmm, the hmm is randomly jumping between mostly two states. How is this possible, if I haven’t changed the model except the MultivariateGaussianDistribution, because of the changed features-dimension? The MultivariateGaussianDistributionare perfectly describing the states properties - looking at the plots descripted above. 6) In my understanding, the only thing I need to change for the new dimensionality is the distributions which should generate stats accurate likelihoods for each Emission. Thus, the most likely state can be chosen. The transition matrix and structure of the model shouldn’t be nessesarry to change. But using MultivariateGaussianDistributionI get the best prediction with a different transition matrix. Can this be correct?

B4marc commented 6 years ago

To my 5th question: Could this be the reason ?

For continuous distributions like the normal, the forward algorithm calculates the joint probability of each state j with the history of evidence over the support of the entire normal, and then evaluates the joint distribution at the point i. Since these are probability density functions, evaluating them at any single point can turn out "probabilities" above 1.

High values in the forward matrix either mean that that observation is really likely given the model or that the normal is constrained due to constraints on the data (which applies here). So the surprisingly high log probs don't necessarily indicate a good model--just a constrained space of observations.