Instability with HSMMPossibleChangepoints, when allowing every frame to be a changepoint

alexbw commented 10 years ago

I've saved a HSMMStatesPossibleChangepoints state which, when calling s.messages_backwards() produces first this warning —

/home/alexbw/Code/pyhsmm/internals/states.py:1211: RuntimeWarning: invalid value encountered in subtract
  + aDl[possible_durations-1] - normalizer,axis=0,out=betastarl[tblock])

and then the following traceback —

In [3]: with open("/scratch/bad_state.pkl", "r") as f: s = pickle.load(f)
s
In [4]: s.messages_backwards()
/home/alexbw/Code/pyhsmm/internals/states.py:1211: RuntimeWarning: invalid value encountered in subtract
  + aDl[possible_durations-1] - normalizer,axis=0,out=betastarl[tblock])
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-d12a0729327b> in <module>()
----> 1 s.messages_backwards()

/home/alexbw/Code/pyhsmm/internals/states.pyc in messages_backwards(self)
   1214         betal[-1] = 0.
   1215
-> 1216         assert not np.isnan(betal).any()
   1217         assert not np.isnan(betastarl).any()
   1218

AssertionError:

The state is saved and available on Jefferson.

Observations

betal and betastarl are almost entirely NaN, this is obviously the symptom we're working backwards from.

Starting with the invalid value in subtract, warning, the arrays are —

aDl[possible_durations-1] and normalizer. The values in normalizer are all very small —

array([  9.16634958e-16,  -2.84332720e-16,   8.35406888e-16,
        -1.16364025e-16,  -6.25881334e-16,  -1.14210176e-16,
        -8.92706352e-16,   2.06959016e-16,  -5.26356327e-16,
         2.42891155e-17,   1.39646320e-16,  -2.85753174e-16,
        -1.84506467e-15,  -5.88668758e-16,  -5.38410241e-16,
        -3.14779159e-16,  -4.72816793e-16,  -8.54616895e-16,
         4.20530006e-16,   6.07096070e-16,   2.74158877e-16,
        -9.29066769e-16,  -1.66557099e-15,   6.49373531e-16,
        -2.19957675e-17,   9.41834127e-16,  -8.53563725e-16,
        -4.33727380e-17,   1.78639797e-16,  -1.08130825e-15,
         3.45001062e-16,  -1.33058494e-15,  -9.29004361e-16,
         1.73516054e-16,  -1.98350807e-16,  -3.82846601e-18,
         2.94222657e-16,  -1.06017490e-16,   4.57774639e-16,
         5.29904454e-16])

The values in aDl are fine, except for the first seven entries, which are negative infinity —

>>> print np.isinf(aDl).sum(1)[:100]
array([40, 40, 39, 36, 35, 34, 26, 24,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

I suppose this means that having a changepoints at the very beginning of the data sequence are impossible under the model, which would make sense for the first frame, but is puzzling that it continues into further frames.

Interestingly, spacing changepoints every 9, 10 or 100 frames works. However, spacing by 8 or fewer frames does not work, and results in the same first 7 frames containing -inf.

Offsetting the first changepoint by providing changepoints as zip(range(10,len(data),1), range(11,len(data)-1,1)) results in the same error.

@mattjj, if I want to compare between possiblechangepoint and non-possiblechangepoint models, I need to be able to provide all frames as changepoints to the model. Any ideas as to why this might be happening?

mattjj commented 10 years ago

Here are some preliminary thoughts.

Don't think of aDl as indexed by time; think of it as indexed by duration. The (i,j)th entry of aDl is log p( duration = i+1 | state = j). So those -infs aren't only relevant to the start of the state sequence.

I think you're on the right track: it's an instability in the possiblechangepoints code when confronted with -inf probabilities in the durations. Maybe that code can be made stable. I'll try loading that state and take a look.

Those -infs happen because you're using the IntNegBin Variant duration distribution. That duration distribution starts at r, meaning it has -inf log pmf values until r. I think that's bad and we shouldn't use it, and getting rid of the -infs would probably fix this issue (but cause others?). Options:

The IntNegBin class (without the "variant" name) doesn't have the start-at-r problem.
Since the possiblechangepoints model doesn't use the intnegbin structure to make the messages any faster, maybe you can just try the NegativeBinomialDuration class. That puts a gamma prior on r (which doesn't have to be an integer, though otherwise it has the same interpretation) so you'd need to learn and play with those new hyperparameters (no more r_discrete_distn stuff). (You could also use any other duration distribution, but negative binomials are still great.)

mattjj commented 10 years ago

Here's what I see with the error:

/home/alexbw/Code/pyhsmm/internals/states.pyc in messages_backwards(self)
   1209             np.logaddexp.reduce(betal[tblock:tblock+truncblock]
   1210                     + self.block_cumulative_likelihoods(tblock,tblock+truncblock,possible_durations)
-> 1211                     + aDl[possible_durations-1] - normalizer,axis=0,out=betastarl[tblock])
   1212             # TODO TODO put censoring here, must implement likelihood_block
   1213             np.logaddexp.reduce(betastarl[tblock] + Al, axis=1, out=betal[tblock-1])

FloatingPointError: invalid value encountered in subtract

In [2]: debug
> /home/alexbw/Code/pyhsmm/internals/states.py(1211)messages_backwards()
   1210                     + self.block_cumulative_likelihoods(tblock,tblock+truncblock,possible_durations)
-> 1211                     + aDl[possible_durations-1] - normalizer,axis=0,out=betastarl[tblock])
   1212             # TODO TODO put censoring here, must implement likelihood_block

ipdb> print normalizer
[-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
 -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
 -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf]

It's because the only possible duration is 1 (possible_durations is array([1])) and all the durations put zero probability on that.

The fix is to add in right_censoring, which I've never added to the PossibleChangepoints models. I'll figure out how to do that now.

alexbw commented 10 years ago

Using the non-variant IntBegBin duration pyhsmm.distributions.NegativeBinomialIntegerRDuration has solved the problem. What do you think is a good way to communicate this type of run-time incompatibility between model and duration distribution to the user?

Also, looking at the code for the variant and non-variant negative binomial integer durations, it seems like in a possiblechangepoint model, they shouldn't behave radically differently, so it hopefully will be an easy drop-in replacement. Anything else I should know about there, in terms of expected behavior differences in HSMMPossibleChangepoint models?

mattjj commented 10 years ago

What do you think is a good way to communicate this type of run-time incompatibility between model and duration distribution to the user?

I think those assertions did their job in this case.

The issue with the variant durations is that they put -infs in their duration log pmfs. That interacted badly with the fact that I never implemented right-censoring for possiblechangepoints models. When right-censoring is implemented this particular blowup won't happen anymore.

As for other potential weirdness, the best thing to do is to read through the code and look out for TODO lines or TODO TODO lines (which I use to mean I really should do the thing in the comment).

mattjj commented 10 years ago

Sorry, I take that back: this error should have resulted in a "zero probability under the model in this iteration" warning and there should have been some hack to forge ahead with the sampling process anyway (probably replacing the duration potentials with uniform potentials). Maybe some more info about what caused the zero probability issue (duration support intersected with possible changepoints was empty) would be better for the user, too.

mattjj commented 10 years ago

That fix should take care of it, but I didn't test it with an intnegbinvariant duration to make sure the blowup goes away. If it's feasible to test this change on your end, can you reopen the issue if there's still an issue with the intnegbinvariants and possiblechangepoint models?

mattjj commented 10 years ago

Sorry, not fixed yet.

alexbw commented 10 years ago

Fixed and verified in 14bbd3c

mattjj / pyhsmm

Instability with HSMMPossibleChangepoints, when allowing every frame to be a changepoint #20

Observations