jmschrei / pomegranate

Fast, flexible and easy to use probabilistic modelling in Python.
http://pomegranate.readthedocs.org/en/latest/
MIT License
3.35k stars 588 forks source link

Exception in _train_once_baum_welch with MultivariateGaussianDistribution #6

Closed matthiasplappert closed 9 years ago

matthiasplappert commented 9 years ago

I'm having a problem when using MultivariateGaussianDistribution in a HMM. My HMM has 10 states and is used with sequences that have 43 features. My covariance matrix is therefore 43x43 and diagonal. I do pass in the diagonal=True flag when training, but the described problem also occurs if it is set to False. When training with Baum-Welch, the library will log with the following exception:

Exception TypeError: "object of type 'NoneType' has no len()" in 'pomegranate.hmm.HiddenMarkovModel._train_once_baum_welch' ignored

This happens in the first round and the training improvement will then be 0.0. Naturally, the algorithm will then terminate since the threshold is reached.

I've tried to figure out exactly what is going on, but I don't have a lot of experience with Cython and wasn't able to figure out where the exception is raised.

It would be great if someone has ideas what could cause this. If someone has pointers on how to find the source of the exception I'm more than glad to do some debugging myself, too.

matthiasplappert commented 9 years ago

Also note that this does not happen if I use IndependentComponentsDistribution with 43 instances of NormalDistribution. I therefore believe this is a problem with the MultivariateGaussianDistribution implementation.

jmschrei commented 9 years ago

I'm guessing it is, I only tested MultivariateGaussianDistributions on GeneralMixtureModels, not HMMs. I'll look into it tonight and hopefully have it fixed by tomorrow. This is kind of a crunch time for me though, so it may be put off until after Thursday when my deadline is.

jmschrei commented 9 years ago

Have not had time to do this tonight. If you're okay with waiting until the end of the week, it's the first thing to do after my deadline. If not, I'd suggest going through the Baum-Welch code, and whenever I take the length of something, print out its datatype and see where there's a mistake. You can write pure python syntax such as "print x" and "print len(x)" if you want to debug-through-printing.

jmschrei commented 9 years ago

There was a minor bug in the assumptions of the GaussianMixtureModel train method, where it assumed the incoming weights were a numpy array, when the hidden Markov model internally optimizes arrays by using memoryviewslices, which have no sum method. You also saw a minor issue where the GaussianMixtureModels don't have a "summarize" training method, so default to storing all the data for summaries and passing it to "from_samples" when the "from_summaries" method is called, but never initializing the summaries variable.

I wrote up a quick test to make sure it worked.

from pomegranate import *
import numpy as np

test = np.abs( np.random.randn( 1, 10, 41 ) )

x = np.abs( np.random.randn( 41, 41 ) )
y = np.abs( np.random.randn( 41, 41 ) ) 

a = State( MultivariateGaussianDistribution( np.random.randn( 41 ), x + x.T + np.diag( np.random.randn(41)+15 ) ), name="a" )
b = State( MultivariateGaussianDistribution( np.random.randn( 41 ), y + y.T + np.diag( np.random.randn(41)+21 ) ), name="b" )

model = HiddenMarkovModel( "debugger" )
model.add_states([a,b])
model.add_transition( model.start, a, 0.5 )
model.add_transition( a, b, 0.5 )
model.add_transition( b, a, 0.5 )
model.add_transition( a, a, 0.5 )
model.add_transition( b, b, 0.5 )
model.bake()

model.train( test, verbose=True )

This resulted in the following:

Training improvement: 513.443028297
Training improvement: 0.372183677725
Training improvement: 89.319524598
Training improvement: 36.8543996521
Training improvement: 42.1239230414
Training improvement: -142.472310953
Total Training Improvement:  539.640748314

Let me know if this fixes the problem. You should be able to get the latest version either from pip or cloning the repo.

jmschrei commented 9 years ago

Also let me know if this is comparable with hmmlearn or not! If not, I will work on making it faster next.

matthiasplappert commented 9 years ago

Hey, thank you so much for looking into this. Things seem better now, but I still run into issues when training the model on my data. Everything works fine if I use hmmlearn.

To make it a bit easier for you to see what's going on, I've put together a little script that is essentially copy-pasted from the larger project that I'm working on (so some things might seem a bit odd and/or unused). The ZIP archive also includes a sample of my data set that I use to train my HMM, and that reproducibly causes problems with your implementation. You can find the script and the data set in this ZIP archive: http://cl.ly/1Q3K2w1D3P22

Important: hmmlearn also had a bug that caused problems but that has since been fixed. However, last time I checked the code wasn't yet published on pip, so you might have to install hmmlearn directly from github: https://github.com/hmmlearn/hmmlearn

For reference, here's the output that I get:

Training hmmlearn model ...
loglikelihood: 14579.097412
loglikelihood: 10411.386627
loglikelihood: 26672.080924
loglikelihood: 30683.017496
loglikelihood: 69031.034677
loglikelihood: 17760.469276
loglikelihood: 21207.962457
loglikelihood: 13702.046580
loglikelihood: 10091.132388
loglikelihood: 15298.439315

Training pomegranate model ...
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Training improvement: 62461.0910771
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._train_once_baum_welch' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Training improvement: 0.0
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
Total Training Improvement:  62461.0910771
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000
Exception numpy.linalg.linalg.LinAlgError: LinAlgError('singular matrix',) in 'pomegranate.hmm.HiddenMarkovModel._log_probability' ignored
loglikelihood: 0.000000

Hope that helps!

matthiasplappert commented 9 years ago

What I forgot to mention: the requirements for my test script are the latest version of hmmlearn (as mentioned above, must be installed from github, not via pip), scikit-learn and, of course, pomegranate.

jmschrei commented 9 years ago

I'm not getting the same thing as you. I get the following:

Training hmmlearn model ...
loglikelihood: 13721.856551
loglikelihood: 11666.539108
loglikelihood: 27198.520132
loglikelihood: 32308.013482
loglikelihood: 43613.059225
loglikelihood: 18941.384830
loglikelihood: 20587.008101
loglikelihood: 15182.771377
loglikelihood: 10704.209738
loglikelihood: 18294.296116

Training pomegranate model ...
Training improvement: 138772.24459
Training improvement: 12096.4336755
Training improvement: 34892.0896727
Training improvement: 6679.37727834
Training improvement: 3557.55051992
Training improvement: 6281.61574595
Training improvement: -2010.74357056
Total Training Improvement:  200268.567912
loglikelihood: 13677.468512
loglikelihood: 7297.197466
loglikelihood: 22481.024339
loglikelihood: 25919.550080
loglikelihood: 29049.612756
loglikelihood: 13330.648552
loglikelihood: 18442.658388
loglikelihood: 9121.962793
loglikelihood: 1518.310274
loglikelihood: 10483.032939

pomegranate did take a long time compared to hmmlearn, with 570 of 593 total seconds running script.py being spent training the pomegranate model. Here is the first bit of my profile of the script:

         213726272 function calls (213289879 primitive calls) in 593.800 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006  593.812  593.812 script.py:1(<module>)
        1    0.000    0.000  592.814  592.814 script.py:117(main)
        1    0.000    0.000  589.322  589.322 script.py:70(test_pomegranate)
        1   16.919   16.919  570.986  570.986 {method 'train' of 'pomegranate.hmm.HiddenMarkovModel' objects}
  1980480    8.927    0.000  566.178    0.000 _multivariate.py:332(logpdf)
  1980480   58.755    0.000  466.615    0.000 _multivariate.py:120(_psd_pinv_decomposed_log_pdet)
  1980482  126.097    0.000  202.166    0.000 decomp.py:205(eigh)
  1980480  112.225    0.000  136.380    0.000 _multivariate.py:100(_pinv_1d)
  9910928   64.521    0.000   64.521    0.000 {method 'reduce' of 'numpy.ufunc' objects}
  1980480   36.933    0.000   61.343    0.000 _multivariate.py:306(_logpdf)
  1980484   18.248    0.000   37.581    0.000 function_base.py:532(asarray_chkfinite)
  3962468    6.106    0.000   37.069    0.000 fromnumeric.py:1621(sum)
  1980482    1.774    0.000   31.429    0.000 lapack.py:239(get_lapack_funcs)
  1980482   11.636    0.000   29.654    0.000 blas.py:216(_get_funcs)
  3967015    3.983    0.000   28.342    0.000 _methods.py:23(_sum)
 13869313   24.023    0.000   24.023    0.000 {numpy.core.multiarray.array}
  1980482    2.517    0.000   20.649    0.000 fromnumeric.py:2048(amax)
  1980762    2.083    0.000   18.149    0.000 _methods.py:15(_amax)

Looks like _multivariate.py is taking the lions share of time, which is a scipy function that calculates logpdfs of multivariate Gaussians. I went to hmmlearn and plugged their log probability calculator into pomegranate and got the following:

Training hmmlearn model ...
loglikelihood: 12912.085772
loglikelihood: 11433.259495
loglikelihood: 26335.342846
loglikelihood: 30861.138924
loglikelihood: 68270.810578
loglikelihood: 15707.781124
loglikelihood: 18395.247364
loglikelihood: 14779.575617
loglikelihood: 10526.716313
loglikelihood: 17578.801065

Training pomegranate model ...
Training improvement: 166318.04045
Training improvement: 34518.1532227
Training improvement: 22388.6208797
Training improvement: 21691.2851604
Training improvement: 13596.7419428
Training improvement: 1645.28423637
Training improvement: 1337.95677641
Training improvement: 1132.69345059
Training improvement: 1598.62529381
Training improvement: -505.380292208
Total Training Improvement:  263722.021121
loglikelihood: 20023.680156
loglikelihood: 13980.874607
loglikelihood: 25805.456011
loglikelihood: 35768.815885
loglikelihood: 30362.047245
loglikelihood: 20943.091238
loglikelihood: 22767.658304
loglikelihood: 17833.570320
loglikelihood: 18028.952644
loglikelihood: 18969.528247

Looks like pomegranate is fitting the data better--unsure if this is a plus or minus immediately. As for time, modest improvements.

         235851992 function calls (234930067 primitive calls) in 455.600 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.005    0.005  455.614  455.614 script.py:1(<module>)
        1    0.000    0.000  431.544  431.544 script.py:117(main)
        1    0.000    0.000  428.240  428.240 script.py:70(test_pomegranate)
        1   76.787   76.787  418.333  418.333 {method 'train' of 'pomegranate.hmm.HiddenMarkovModel' objects}
  2723160   28.375    0.000  158.283    0.000 basic.py:104(solve_triangular)
  2723160    2.098    0.000  125.092    0.000 decomp_cholesky.py:37(cholesky)
  2723160   44.537    0.000  122.994    0.000 decomp_cholesky.py:15(_cholesky)
  8169484   49.303    0.000  109.555    0.000 function_base.py:532(asarray_chkfinite)
  5446322    3.730    0.000   91.957    0.000 lapack.py:239(get_lapack_funcs)
  5446322   24.857    0.000   88.227    0.000 blas.py:216(_get_funcs)
2723286/2723259    4.431    0.000   71.319    0.000 {map}
 13624091   70.756    0.000   70.756    0.000 {method 'reduce' of 'numpy.ufunc' objects}
  5446322   15.211    0.000   54.676    0.000 blas.py:172(find_best_blas_type)
  8169535    4.748    0.000   49.715    0.000 {method 'all' of 'numpy.ndarray' objects}
  5447802    7.205    0.000   46.739    0.000 fromnumeric.py:1621(sum)
  8169535    6.138    0.000   44.967    0.000 _methods.py:35(_all)

Looks like that got rid of ~20% of the time, indicating that it is not the primary reason pomegranate is so slow. I'm guessing it may be the vectorized implementation, where they calculate log probabilities across vectors instead of single samples. I'll take a look soon into seeing if I can do that with pomegranate easily.

matthiasplappert commented 9 years ago

Interesting that you do not run into the numerical problems that I have. Could you post some details on your environment? I'm running the test script on OS X with python 2.7.9 and the following packages:

Cython==0.22
decorator==3.4.2
ghmmwrapper==0.8
hmmlearn==0.1.1
matplotlib==1.4.3
mock==1.0.1
networkx==1.9.1
nose==1.3.6
numpy==1.9.2
pomegranate==0.2.2
pybasicbayes==0.1.2
pyhsmm==0.1.3
pyparsing==2.0.3
python-dateutil==2.4.2
pytz==2015.2
scikit-learn==0.16.1
scipy==0.15.1
six==1.9.0
yahmm==1.1.3

I'll try to run the test script on an Ubuntu machine on Monday, maybe this is a problem with the vector library that is used on OS X.

jmschrei commented 9 years ago

These are what I have on python 2.7.5:

# packages in environment at C:\Anaconda:
#
cython                    0.22                     py27_0  
hmmlearn                  0.1.1                     <pip>
matplotlib                1.3.1                np18py27_2  
mock                      1.0.1                    py27_0  
networkx                  1.8.1                    py27_0  
nose                      1.3.6                    py27_0  
numpy                     1.9.2                    py27_0  
python-dateutil           1.5                       <pip>
pytz                      2014.3                   py27_0  
scipy                     0.15.1               np19py27_0  
six                       1.6.1                    py27_0  
yahmm                     1.1.2                     <pip>
matthiasplappert commented 9 years ago

I tried this on my lab machine that runs Ubuntu today and I run into the same problems as described above. Not sure what to do about it.

jmschrei commented 9 years ago

I'm going to assume this is an issue on my end. I'll push an update which does vectorized log probabilities and look into it, and hopefully that will fix the problem.

jmschrei commented 9 years ago

Have not forgotten about this; end of quarter is bringing a surge of work. I should have time by this weekend to take a look again.

jmschrei commented 9 years ago

I'm going to update pomegranate soon, using sklearn's mixture for determine log probabilities with full covariance matrices. These are the results I get. pomegranate is still slow, but this is a fundamental issue where pomegranate uses nested loops to fill out the responsibility matrix in training whereask hmmlearn uses vectorized operations, because you define the emissions of the distribution in the HMM class you use. I'm going to open a new issue related to speeding it up. If this works on your machine, please close this issue.

Training hmmlearn model ...
loglikelihood: 14378.826304
loglikelihood: 7436.041577
loglikelihood: 25527.992772
loglikelihood: 30680.158505
loglikelihood: 69017.608199
loglikelihood: 17633.716075
loglikelihood: 20481.085973
loglikelihood: 11173.908924
loglikelihood: 11421.345206
loglikelihood: 12271.200100

Training pomegranate model ...
Training improvement: 178572.691115
Training improvement: 59278.5564942
Training improvement: 17177.3112394
Training improvement: 17826.8669639
Training improvement: 12260.8237182
Training improvement: 3744.23513601
Training improvement: 4454.34300532
Training improvement: 271.079902405
Training improvement: 2488.98215595
Training improvement: 492.431898427
Total Training Improvement:  296567.321629
loglikelihood: 13276.204876
loglikelihood: 7820.492071
loglikelihood: 23310.844539
loglikelihood: 27623.989639
loglikelihood: 86289.825961
loglikelihood: 13624.032716
loglikelihood: 18092.748728
loglikelihood: 11937.756392
loglikelihood: 26996.503636
loglikelihood: 12731.536545
matthiasplappert commented 9 years ago

Hey Jacob,

Sorry for the late response–I've also been busy. I just updated pomegranate, and the issue seems to be resolved. So far I've tested it with the example that I provided earlier and with a small data set of my actual data. I'm currently training a couple of pomegranate HMMs on a much larger dataset, which is going to take a while. I'm confident that everything works, so I'll close this issue now but will re-open if anything goes wrong.

Thanks for your time and help!

jmschrei commented 9 years ago

Hi @matthiasplappert

I'm unsure if you're still using pomegranate, but I've been speeding things up. The training of multivariate gaussians is now significantly faster-- 50 batches of 100,000 points (5 dimensions) is now 11 seconds to train instead of 75 seconds. I should have this branch merged within a week. Hopefully this makes pomegranate a bit more attractive when compared to hmmlearn =)