jmschrei / yahmm

Yet Another Hidden Markov Model repository.
MIT License
249 stars 32 forks source link

Training fails on 1000 samples from sunny-rainy example #10

Closed nipunbatra closed 10 years ago

nipunbatra commented 10 years ago

Notebook here

NB: I had modified sample as per #8 to sample 1000 sequence length.

jmschrei commented 10 years ago

This is an issue because your sequence was thrown out for not being likely enough (aka scoring -INF on the forward algorithm). This is an issue with underflow. I tried replacing the pair_lse function with the scipy one, and it calculated those values at nan as well, for being far too low. I am looking into writing a new implementation which will fix the underflow problem.

You also uncovered a bug with forward-backward, where originally it yields two matrices, but if the sequence is deemed impossible, will just return -np.inf, making calls not work. Maybe the best solution to that is to do the calculation anyway, so they always get a matrix.

Do your use cases require 1000 observation sequences, or are you just testing the limits?

nipunbatra commented 10 years ago

Some tricks

  1. http://www.johndcook.com/blog/2012/07/26/avoiding-underflow-in-bayesian-computations/
  2. http://www.codeproject.com/Articles/25294/Avoiding-Overflow-Underflow-and-Loss-of-Precision

Do your use cases require 1000 observation sequences, or are you just testing the limits?

I am sure I have handled much more samples before with the scikit-learn implementation. How do they avoid this underflow?

nipunbatra commented 10 years ago

Also, are you normalizing at each iteration as suggested by Rabiner?

This question on Stack Overflow also mentions the same: http://stackoverflow.com/questions/13391625/underflow-in-forward-algorithm-for-hmms

jmschrei commented 10 years ago

I am not normalizing each iteration. It seems like a good idea, so I'll look into that.

I'm currently rewriting the logsumexp function to avoid underflow, and include other cases for speedups. It should be up by tomorrow. Just using that trick isn't enough. I implemented it as:

cdef inline double pair_lse( double x, double y ):
    if x + y == INF:
        return INF
    if x > y:
        return x + clog( cexp(y-x) + 1 )
    return y + clog( cexp( x-y ) + 1 )

As for sklearn, I used the sklearn implemented of logsumexp, as well as the scipy implementation of logsumexp, and those both caused underflow. I am guessing sklearn might do normalization inside their HMM algorithm to ensure underflow doesn't happen.

jmschrei commented 10 years ago

Also about sklearn, were you doing training using 1000+ observations, or the viterbi path? The viterbi path is currently working for yahmm, even at a million observations, since it never needs to do exp(x).

nipunbatra commented 10 years ago

using Baum Welch. An example here: http://nbviewer.ipython.org/github/nilmtk/nilmtk/blob/master/notebooks/fhmm.ipynb

jmschrei commented 10 years ago
  1. By weighting the rows as described in http://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf , I got a successful forward algorithm which does not suffer from underflow, with 1 million observations tested! I will test it more tomorrow, and implement it for the other algorithms tomorrow.
  2. I will implement a new version of logsumexp as described here http://gasstationwithoutpumps.wordpress.com/2014/05/06/sum-of-probabilities-in-log-prob-space/ to ensure that we never have underflow errors.
jmschrei commented 10 years ago

By implementing the two changes described above, the underflow issue should be solved. It will still take time to do training on large datasets though! Re-open if still doesn't work after v0.1.3 release.