jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.29k stars 590 forks source link

[help wanted] model.viterbi does not always return a result #1035

Open Niloufar375 opened 1 year ago

Niloufar375 commented 1 year ago

Hello,

I am working on this project , where I have the sequences of port visits for some ships. I have a round 1500 of these sequences . My objective is to use HMM to predict the next port visit given a new sequence of ports . First , I am using the Baum-welch algorithm to learn both the transition and emission from the data , and then given a new test sequence, I use the viterbi algorithm to the find the most probable states Then with the help of the learned transition matrix , I retrieve the next probable port .

There are some problems I am facing with , which would be really appreciated to hear some feed back on .

First off, in total there are around 350 different ports and I have 1500 rows of data, obviously I can not run HMM on all the data, So I made clusters of some 'similar' sequences [ my clustering method is using K means on major ports of each sequences ] . Each cluster has between 20 to 60 sequences and in each cluster around 100 different ports are being used to make up the sequences.

So as I said the very first step is to learn the parameters in each cluster :

model2 = HiddenMarkovModel.from_samples( distribution=DiscreteDistribution, n_components=k, X=X_train, algorithm='baum-welch', verbose=True, n_jobs=4, stop_threshold = 1e-9

Here are the major problems I am facing :

Problem 1 . Model.viterbi or model.predict rarely gives an answer . 99 percent of the time the sequence is impossible

I read in the issues before that this was a common problem for other people as well, and it was fixed by adding a start node. However, in my model I am not defining the states myself, it is only learning everything from the samples . How can I add a start node when I know nothing of the structure ?

Problem 2 . ( falls under the same category of problem 1 ) When I try to use the model on the test data set , it seems like the model does not realize the test sequence. Basically when I give it a part of the test sequence and use model.viterbi(seq) , it returns an empty list . ( or even the whole sequence)

I realized some times it is very important 'which' part of the test sequence I give the model . For example if i give the model : port a , port , b, port c, port d . the model.viterbi works and I can get the next port estimation from the transition matrix but when i give it port , b, port c, port d , model.viterbi gives an empty list again . Why is that ?

Thank you for your help in advance !