Closed nipunbatra closed 10 years ago
This is an issue because your sequence was thrown out for not being likely enough (aka scoring -INF on the forward algorithm). This is an issue with underflow. I tried replacing the pair_lse function with the scipy one, and it calculated those values at nan as well, for being far too low. I am looking into writing a new implementation which will fix the underflow problem.
You also uncovered a bug with forward-backward, where originally it yields two matrices, but if the sequence is deemed impossible, will just return -np.inf, making calls not work. Maybe the best solution to that is to do the calculation anyway, so they always get a matrix.
Do your use cases require 1000 observation sequences, or are you just testing the limits?
Some tricks
Do your use cases require 1000 observation sequences, or are you just testing the limits?
I am sure I have handled much more samples before with the scikit-learn implementation. How do they avoid this underflow?
Also, are you normalizing at each iteration as suggested by Rabiner?
This question on Stack Overflow also mentions the same: http://stackoverflow.com/questions/13391625/underflow-in-forward-algorithm-for-hmms
I am not normalizing each iteration. It seems like a good idea, so I'll look into that.
I'm currently rewriting the logsumexp function to avoid underflow, and include other cases for speedups. It should be up by tomorrow. Just using that trick isn't enough. I implemented it as:
cdef inline double pair_lse( double x, double y ):
if x + y == INF:
return INF
if x > y:
return x + clog( cexp(y-x) + 1 )
return y + clog( cexp( x-y ) + 1 )
As for sklearn, I used the sklearn implemented of logsumexp, as well as the scipy implementation of logsumexp, and those both caused underflow. I am guessing sklearn might do normalization inside their HMM algorithm to ensure underflow doesn't happen.
Also about sklearn, were you doing training using 1000+ observations, or the viterbi path? The viterbi path is currently working for yahmm, even at a million observations, since it never needs to do exp(x).
using Baum Welch. An example here: http://nbviewer.ipython.org/github/nilmtk/nilmtk/blob/master/notebooks/fhmm.ipynb
By implementing the two changes described above, the underflow issue should be solved. It will still take time to do training on large datasets though! Re-open if still doesn't work after v0.1.3 release.
Notebook here
NB: I had modified sample as per #8 to sample 1000 sequence length.