mattjj / pyhsmm

MIT License
547 stars 173 forks source link

Small change so that resampling doesn't choke when using a Multinomial distribution #1

Closed alexbw closed 11 years ago

alexbw commented 11 years ago

If a state hadn't yet been observed, then indexing "data" fails with an empty indexing array. Just added a quick check to make sure it doesn't happen.

mattjj commented 11 years ago

Hrm, can you give me a way to reproduce the problem?

Indexing with empty arrays should work fine:

In [1]: stateseq = np.array([1,2,3])

In [2]: data = randn(3)

In [3]: data[stateseq == 4]
Out[3]: array([], dtype=float64)

and calling a Distribution's resample method with the data argument set to an empty array like that should work. It works for the Distributions I usually plug in there (e.g. Gaussians), but my current guess is that for the Distribution class you're using for obs_distns has a bug where it doesn't like getting empty arrays.

alexbw commented 11 years ago

This fails for me:

from pyhsmm.util.text import progprint_xrange

stateseq = [1, 2, 3] \* 100
num_possible_states = len(np.unique(stateseq))
Nmax = 50

obs_hypparams = {'alpha_0':3, 
                 'K':num_possible_states}

obs_distns = [pyhsmm.distributions.Multinomial(**obs_hypparams) for state in range(Nmax)]

posteriormodel = pyhsmm.models.StickyHMM(obs_distns=obs_distns, init_state_concentration=6.0,
                                         kappa=50, alpha=0.1, gamma = 0.01)

posteriormodel.add_data(stateseq)

for index in progprint_xrange(100):
    posteriormodel.resample_model()

error message:


TypeError Traceback (most recent call last)

in () 17 18 for index in progprint_xrange(100): ---> 19 posteriormodel.resample_model() /Users/Alex/Dropbox/Science/Datta lab/Playground/HDP-HSMM/pyhsmm/models.pyc in resample_model(self) 60 # resample obsparams 61 for state, distn in enumerate(self.obs_distns): ---> 62 distn.resample([s.data[s.stateseq == state] for s in self.states_list]) 63 64 # resample transitions TypeError: only integer arrays with one element can be converted to an index On Nov 10, 2012, at 12:57 PM, Matthew Johnson wrote: > Hrm, can you give me a way to reproduce the problem? > > Indexing with empty arrays should work fine: > > In [1]: stateseq = np.array([1,2,3]) > > In [2]: data = randn(3) > > In [3]: data[stateseq == 4] > Out[3]: array([], dtype=float64) > and calling a Distribution's resample method with the data argument set to an empty array like that /should/ work. It works for the Distributions I usually plug in there (e.g. Gaussians), but my current guess is that for the Distribution class you're using for obs_distns has a bug where it doesn't like getting empty arrays. > > — > Reply to this email directly or view it on GitHub. > > ```
alexbw commented 11 years ago

Additionally, wrapping "stateseq" with np.asarray() results in an different error stemming from the same line:


ValueError Traceback (most recent call last)

in () 17 18 for index in progprint_xrange(100): ---> 19 posteriormodel.resample_model() /Users/Alex/Dropbox/Science/Datta lab/Playground/HDP-HSMM/pyhsmm/models.pyc in resample_model(self) 60 # resample obsparams 61 for state, distn in enumerate(self.obs_distns): ---> 62 distn.resample([s.data[s.stateseq == state] for s in self.states_list]) 63 64 # resample transitions /Users/Alex/Dropbox/Science/Datta lab/Playground/HDP-HSMM/pyhsmm/basic/pybasicbayes/distributions.pyc in resample(self, data) 659 660 def resample(self,data=[]): --> 661 hypparams = self._posterior_hypparams(*self._get_statistics(data)) 662 self.weights = np.random.dirichlet(np.where(hypparams>1e-2,hypparams,1e-2)) 663 /Users/Alex/Dropbox/Science/Datta lab/Playground/HDP-HSMM/pyhsmm/basic/pybasicbayes/distributions.pyc in _get_statistics(self, data) 670 counts = np.bincount(data,minlength=K) 671 else: --> 672 counts = sum(np.bincount(d,minlength=K) for d in data) 673 return counts, 674 /Users/Alex/Dropbox/Science/Datta lab/Playground/HDP-HSMM/pyhsmm/basic/pybasicbayes/distributions.pyc in ((d,)) 670 counts = np.bincount(data,minlength=K) 671 else: --> 672 counts = sum(np.bincount(d,minlength=K) for d in data) 673 return counts, 674 ValueError: The first argument cannot be empty.
mattjj commented 11 years ago

I was about to suggest that you put a np.array around the construction of your 'statseq' variable; the first error you saw is due to the fact that Python lists don't support logical or empty indexing, which the code always assumes. In fact, data should always be either an np.ndarray (or possibly a maskedarray) or a list of np.ndarrays (or masked arrays). So applying np.array as you did will fix the first problem. It may be a decent improvement to add some more "type-checking asserts" to the code to catch this kind of thing and provide more informative errors.

As for the second error, does your np.bincount fail on this?

In [1]: np.version.version
Out[1]: '1.6.2'

In [2]: np.bincount(np.array([],dtype=int),minlength=4)
Out[2]: array([0, 0, 0, 0])

IIRC, older versions of bincount wouldn't take an empty first argument, but it got fixed at some point so I removed the check I had for it. Not very good software engineering, especially since I probably didn't update the README's version information.

To fix that, you can either update your numpy or add a check for an empty array on that line.

There's a third error that comes up even when that is fixed! It is due to the fact that the Multinomial distribution isn't symbol-agnostic, in that it assumes its data will be in {0,1,...,K-1}, and in fact that's how the K parameter is defined. In other words, an ndarray of data should satisfy (data < K).all(). However, your sample code initializes the object with K=3 and then passes it data with a maximum index of 3, which is out of the range it expects. Since the code doesn't check against that sort of thing, you get a cryptic error like this:

ValueError: operands could not be broadcast together with shapes (3) (4) 

To avoid that, you should map your observation symbols to {0,1,...,something}. In other words, do [0,1,2]_100 or range(3)_100 instead.

That's probably bad behavior to leave as a secret. I can either make the requirement clear in the docstring for Multinomial and add some asserts OR I can add some feature like (internal to the Multinomial) mapping any data symbol set to the canonical range {0,1,...,K-1} (which can be done nicely with a defaultdict and itertools.count). The latter might be convenient for symbol sets like {'A','C','T','G'} or something. Do you have a suggestion as to which path is better?

One last suggestion: I think using the variable name stateseq in your example code is a bit confusing, since it's really a data sequence and not a state sequence according to the pyhsmm names. Maybe rename it to data or obs_sequence.

Summary

Let me know if I've misunderstood something, or if more surprises come up!

alexbw commented 11 years ago

Right on all (bin)counts! Upgraded NumPy and shifted all labels to live in [0, 1, ..., n], now works properly. Thanks!

On Nov 10, 2012, at 1:36 PM, Matthew Johnson notifications@github.com wrote:

I was about to suggest that you put a np.array around the construction of your 'statseq' variable; the first error you saw is due to the fact that Python lists don't support logical or empty indexing, which the code always assumes. In fact, data should always be either an np.ndarray (or possibly a maskedarray) or a list of np.ndarrays (or masked arrays). So applying np.array as you did will fix the first problem. It may be a decent improvement to add some more "type-checking asserts" to the code to catch this kind of thing and provide more informative errors.

As for the second error, does your np.bincount fail on this?

In [1]: np.version.version Out[1]: '1.6.2'

In [2]: np.bincount(np.array([],dtype=int),minlength=4) Out[2]: array([0, 0, 0, 0]) IIRC, older versions of bincount wouldn't take an empty first argument, but it got fixed at some point so I removed the check I had for it. Not very good software engineering, especially since I probably didn't update the README's version information.

To fix that, you can either update your numpy or add a check for an empty array on that line.

There's a third error that comes up even when that is fixed! It is due to the fact that the Multinomial distribution isn't symbol-agnostic, in that it assumes its data will be in {0,1,...,K-1}, and in fact that's how the K parameter is defined. In other words, an ndarray of data should satisfy (data < K).all(). However, your sample code initializes the object with K=3 and then passes it data with a maximum index of 3, which is out of the range it expects. Since the code doesn't check against that sort of thing, you get a cryptic error like this:

ValueError: operands could not be broadcast together with shapes (3) (4) To avoid that, you should map your observation symbols to {0,1,...,something}. In other words, do [0,1,2]_100 or range(3)_100 instead.

That's probably bad behavior to leave as a secret. I can either make the requirement clear in the docstring for Multinomial and add some asserts OR I can add some feature like transparently mapping any symbol set to the canonical range {0,1,...,K-1} (which can be done nicely with a defaultdict and itertools.count). Do you have a suggestion as to which path is better?

One last suggestion: I think using the variable name stateseq in your example code is a bit confusing, since it's really a data sequence and not a state sequence according to the pyhsmm names. Maybe rename it to data or obs_sequence.

Summary

pass in data as an np.ndarray (or list of np.ndarrays) update numpy so that this doesn't cause an exception: np.bincount(np.array([],dtype=int),minlength=4) give Multinomial data in the range {0,1,...,K-1} Let me know if I've misunderstood something, or if more surprises come up!

— Reply to this email directly or view it on GitHub.