jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.38k stars 590 forks source link

HMM labeled fit zero improvement #250

Closed himat closed 7 years ago

himat commented 7 years ago

I used hmm.fit with the labeled fitting, but I got 0 improvement.

Here's my code: https://gist.github.com/himat/7c07190af11a1b16917b53d7a872c167

Am I doing something wrong with the list of lists that the fit function expects for the input?

jmschrei commented 7 years ago

Howdy. To determine if this is a bug with the labeled fitting algorithm or a formatting issue, try the default fitting technique. If you see an improvement, it might be an issue on my end. I know there have been a few issues raised recently when trying to use it, and it's on my queue of things to look at. If possible, it would be good if you can email the data (or a subset of it) to me so that I can easily take a look.

himat commented 7 years ago

Based on your comment on #237 about how both the data and labels should be a list of lists, I changed my labels to label_states = [[model.states[i] for i in label_indices]] So that it's just a list containing a single list since I am only giving it one continuous sequence of observations. Previously, it was a list of multiple lists (each observation was its own list)

But now I get the error

Accuracy:  0.663120567376                                           │
python(50640,0x7fffbbf133c0) malloc: *** error for object 0x7f864b70│
b9c8: incorrect checksum for freed object - object was probably modi│
fied after being freed.                                             │
*** set a breakpoint in malloc_error_break to debug                 │
Abort trap: 6                                                       │

Here's my data https://drive.google.com/open?id=0B30NQVrTKZ_GN1pwaEVELURVUnc

jmschrei commented 7 years ago

Sorry for the delay in getting back to you, I've had back-to-back conferences to stress over.

The first thing I noticed about your code is that you have a major error when formatting the data. You write on L47:

head_yaw_data = [np.array([x]) for x in head_yaw_data]

This basically turns every observation into its own sample. Instead of seeing a sequence of length 282, you see 282 sequence of length 1. You should change it to just be:

head_yaw_data = [np.array(head_yaw_data)]

This probably explains why you see the accuracy go down. When I just change that and run Baum-Welch, printing the log probability before and after through model.log_probability(head_yaw_data[0]), I get the following:

-0.408109182792
Improvement:  347.168762583
0.13173712484

If you propogate that change down to labels by modifying L51 to be the following

label_states = [[model.states[i] for i in label_indices]]

then running with labels works properly. However, the before/after is now:

-0.408109182792
Improvement:  -204.840939168
-1.04338103841

This is not super surprising, because you only have a single sequence, and labeled training can often produce worse results than unsupervised. What you might consider doing is initializing your results based on the labels, then refining using unsupervised training. When I say initializing, I mean do something like:

d_forward = NormalDistribution.from_samples(data[data['label] == 'looking forward'])
s_forward = State(d_forward), name="looking forward")

That way you start off with a supervised estimate, but get the benefits of unsupervised training as well.

Also note that to get the results I did above you have to change the distributions a bit to the following:

s_left = State(NormalDistribution(1.0, .2), name="looking left")
s_transit_left_forward = State(NormalDistribution(0.2, .1), name="-looking left to forward-")
s_transit_forward_left = State(NormalDistribution(0.6, .1), name="-looking forward to left-")
s_forward = State(NormalDistribution(0, .2), name="looking forward")
s_transit_right_forward = State(NormalDistribution(-0.2, .1), name="-looking right to forward-")
s_transit_forward_right = State(NormalDistribution(-0.6, .1), name="-looking forward to right-")
s_right = State(NormalDistribution(-1.0, .2), name="looking right")

This is because if you're initializing both the transitions to/from these states and the emissions of these states to the same values then you can't cluster properly when you run Baum-Welch.

Let me know if you have any other questions!

himat commented 7 years ago

Amazing information! Thank you so much!

  1. Do you know if there's a major difference in learning performance when using a single long observation sequence vs multiple smaller ones? In my code above, I'm using a single observation sequence after your edits. But would it perform better if I had broken it into smaller observations? I realize in some cases, it's not possible to have a single very long sequence of observations such as in a left-right HMM where the observations have to end eventually. But in my case, my data can go through any of the states at any time (just modeling a head moving left and right continuously) so I can just make one huge single observation sequence. Any insights about that?

  2. Also, why would labeled training be worse than unsupervised (Baum-Welch)?

  3. Can you explain your last statement more? Specifically, why you had to change the transition states to use these values instead of 0.4 for both

    s_transit_left_forward = State(NormalDistribution(0.2, .1), name="-looking left to forward-")
    s_transit_forward_left = State(NormalDistribution(0.6, .1), name="-looking forward to left-")
  4. So you recommended using .from_samples() for initializing my states, and that seems pretty good. But would it be possible to do the same thing for the transition states? Right now, I just initialized them to be on the border between the two looking states (like between forward and left), so I don't think it could be estimated from data?

Thank you!

jmschrei commented 7 years ago
  1. You should train your model on whatever type of sequence you'll be making predictions on. Training on one long sequence prevents you from being able to adequately train the probabilities of starting in each state. But you shouldn't randomly split a sequence into smaller ones for no good reason, because then you end up with unfaithful starting probabilities. If your 'small' sequences are one symbol long then you can't train your transition matrix at all.

  2. It can be worse if you don't have enough data because you can't properly train the transition matrix and end up with 0s that exist because certain transitions just don't happen to occur in the training set (though you can add pseudocounts to smooth this over).

  3. Sure. Imagine you're doing simple expectation-maximization with two normal distributions. If they have the same parameters, then in the E step all samples will be 50% one distribution and 50% the other, and in the M step you'll update the distributions to have the same new parameters because you're doing the same math. If you add in some jitter, then the two will diverge.

  4. It would likely be sufficient to just set the transitions to be a uniform over however many out edges there are. If you have a corner edge with two out transitions, then 0.5 for each should be fine, and inner nodes with 4 out transitions could get 0.25. Generally, setting the emissions to be different is enough.

himat commented 7 years ago
  1. I didn't mean the actual transition probabilities, but the intermediate states I have that are called s_transit_forward_right and such in my code. I'm using those since I want to specifically capture when the head is moving. So my question was how to initialize those - I just made a Normal distribution with a mean on the border between s_forward and s_right for example.

Also, since I'm really only using the transition states as transitions that I want to capture happening, does using a Normal distribution for them make sense, or is there another distribution that would be better?

jmschrei commented 7 years ago

That seems like a reasonable idea. Particularly if you have a transition structure such that you need to go through the transition state, you'll likely get decent results. I can't say what distribution would match your data well, but normal distributions generally do well, and if the rest of your distributions are normal than intuitively this one should be as well.

jmschrei commented 7 years ago

Has this issue been resolved?