Sampler converging on "wrong" answer: Incorrectly evaluating logp?

keflavich commented 11 years ago

I'm dealing with a bug that has me completely perplexed; I honestly can't tell if it's something in pymc or something weird that I've done, but the end result is that the MCMC sampler converges on a result with much lower probability than the start point and completely rejects the "correct" values.

To debug this, I've patched my code with the following extra debug statements (which might be of general interest?): https://github.com/keflavich/pymc/commit/27732ed5e23b87df3961684f583a4bfbb2f929cd

My sampling looks like this:

>>> mc40.sample(1,verbose=5) #ok, verbose=4 is adequate....

Burn-in interval complete
Step method AdaptiveMetropolis_TEMPERATURE0_COLUMN0_ORTHOPARA0_DENSITY0 stepping
        AdaptiveMetropolis_TEMPERATURE0_COLUMN0_ORTHOPARA0_DENSITY0 Current log-likelihood plus current log-probability 371.132206505
Current value:  [ 32.15067948  12.48718019   0.30510997   4.21776136]
Current likelihood:  371.132206505
Jump : [-0.64332924  0.21059998 -0.04582936  0.02066177]
        AdaptiveMetropolis_TEMPERATURE0_COLUMN0_ORTHOPARA0_DENSITY0 Current log-likelihood plus current log-probability -1522.83027611
Current value:  [ 31.50735024  12.69778017   0.25928061   4.23842313]
Current likelihood:  -1522.83027611
Rejected
Step method Metropolis_WIDTH0 stepping

Metropolis_WIDTH0 getting initial logp.
        Metropolis_WIDTH0 Current log-likelihood plus current log-probability -565073.087919
Metropolis_WIDTH0 value before proposing:  0.372648880284
Metropolis_WIDTH0 logp before proposing:  3.72083402348
        Metropolis_WIDTH0 Current log-likelihood  -565076.808753
Total loglike before proposing:  -565076.808753
Metropolis_WIDTH0 proposing.
Metropolis_WIDTH0 value after proposing:  0.36808465319
Metropolis_WIDTH0 logp after proposing:  3.00373650608
        Metropolis_WIDTH0 Current log-likelihood  -555949.785666
Total loglike after proposing:  -555949.785666
        Metropolis_WIDTH0 Current log-likelihood  -555949.785666
Total loglike after-before:  9123.30225314
        Metropolis_WIDTH0 Current log-likelihood plus current log-probability -555946.78193
logp_p - logp:  9126.30598965
Metropolis_WIDTH0 accepting
Metropolis_WIDTH0 returning.
Step method Metropolis_CENTER0 stepping

Metropolis_CENTER0 getting initial logp.
        Metropolis_CENTER0 Current log-likelihood plus current log-probability -555946.233049
Metropolis_CENTER0 value before proposing:  39.5413860111
Metropolis_CENTER0 logp before proposing:  3.55261688359
        Metropolis_CENTER0 Current log-likelihood  -555949.785666
Total loglike before proposing:  -555949.785666
Metropolis_CENTER0 proposing.
Metropolis_CENTER0 value after proposing:  37.9412012151
Metropolis_CENTER0 rejecting due to ZeroProbability.
Metropolis_CENTER0 returning.
MCMC tallying.
MCMC done tallying.

Sampling finished normally.

The problem is that the step has accepted a value with very low total likelihood:

In [2]: mc40.WIDTH0.value
Out[2]: array(0.3680846531895133)

In [3]: mc40.logp
Out[3]: 375.7484261522879

In [6]: mc40.WIDTH0.value =   0.378232

In [7]: mc40.logp
Out[7]: 379.53196249724726

The latter is pretty close to optimal. That's not so bad, but if I repeat with ~10-15 samples, I rapidly hit something exceedingly unlikely:

In [19]: mc40.WIDTH0.value
Out[19]: array(0.319464682867294)

In [20]: mc40.logp
Out[20]: 246.58753561908654

Can you provide any tips on what's causing these issues? Am I perhaps running into floating underflows that are making the log-probabilities come out incorrectly?

I'm happy to provide the full data etc. that goes into this, but frankly it's fairly complicated and difficult to set up, so it's probably best if there are some specific questions I should be asking myself.

fonnesbeck commented 11 years ago

Let me see if I am reading your output correctly. In the first step, you get a rejection for trying to move from 371.132206505 to -1522.83027611, which sounds good. Then in the next (WIDTH0), there is a difference of 9123.30225314 going to the proposed value, which is accepted. Finally, in CENTER0 there is a zero probability log-likelihood, which is rejected. These all seem like sensible outcomes to me. Can you provide an example of something that is proposed that is very unlikely relative to the current value which ends up being accepted? Its a good idea to monitor the alpha values (i.e. the acceptance probabilities), and see where things go wrong, if at all. Perhaps, are you getting lots of zero probability issues for steps that ought to be accepted? Its hard to say without seeing your model.

keflavich commented 11 years ago

The problem is with WIDTH0. I expected the CENTER0 rejection; that looked good to me (it clearly needs to be tuned to sample within the allowed range).

The problem is that, for WIDTH0, the difference is 9123... going from 0.372648880284 to 0.36808465319, but this is the wrong sign! 0.368 is less likely than 0.372 (though I admit I did not demonstrate that in the example). That's why I'm concerned that there must be some sort of underflow or something horrible going on.

fonnesbeck commented 11 years ago

Are you using entirely existing distributions, or do you have custom stochastics with hand-coded log-likelihoods? You should be able to hand-verify the log-likelihoods by taking the values in the trace and plugging them back into the likelihood functions, and see if the output is reasonable.

Also, do you get the same behavior if you switch to using other step methods? I notice you are using both AdaptiveMetropolis and Metropolis.

Finally, where did you get the PyMC that you are using? Ddi you install a binary from PyPi, or did you build from GitHub? I recommend the latter if you can do it.

keflavich commented 11 years ago

Using entirely existing distributions for the stochastics:

{<pymc.distributions.Uniform 'DENSITY0' at 0x12b2c2dd0>,
 <pymc.distributions.Uniform 'TEMPERATURE0' at 0x12b2c2990>,
 <pymc.distributions.Uniform 'COLUMN0' at 0x12b2c2d90>,
 <pymc.distributions.TruncatedNormal 'CENTER0' at 0x12b2bfb50>,
 <pymc.distributions.TruncatedNormal 'WIDTH0' at 0x12b2bfc90>,
 <pymc.distributions.Uniform 'ORTHOPARA0' at 0x12b2c2a50>}

The tricky bit is that I'm fitting a spectrum, so I also have 78 data points in a Normal distribution: pymc.distributions.Normal('data',mu=funcdet,tau=1/np.asarray(both40.error)**2,observed=True,value=np.asarray(both40.data)) which depends on a fairly complicated deterministic model funcdet=pymc.Deterministic(name='f',eval=modelfunc,parents=funcdict,doc="The model function")

Yes, I tried switching everything to Metropolis and got the same behavior
Installed from github.

keflavich commented 11 years ago

Here's a complete example using Metropolis that I think illustrates all the bad behavior I'm seeing:

In[1]: mc40.sample(1,verbose=5)

Burn-in interval complete
Step method Metropolis_DENSITY0 stepping

Metropolis_DENSITY0 getting initial logp.
        Metropolis_DENSITY0 Current log-likelihood plus current log-probability 378.791461915
Metropolis_DENSITY0 value before proposing:  4.21776135726
Metropolis_DENSITY0 logp before proposing:  -1.60943791243
        Metropolis_DENSITY0 Current log-likelihood  380.400899827
Total loglike before proposing:  380.400899827
Metropolis_DENSITY0 proposing.
Metropolis_DENSITY0 value after proposing:  9.57083515185
Metropolis_DENSITY0 rejecting due to ZeroProbability.
Metropolis_DENSITY0 returning.
Step method Metropolis_COLUMN0 stepping

Metropolis_COLUMN0 getting initial logp.
        Metropolis_COLUMN0 Current log-likelihood plus current log-probability 378.791461915
Metropolis_COLUMN0 value before proposing:  12.4871801942
Metropolis_COLUMN0 logp before proposing:  -1.60943791243
        Metropolis_COLUMN0 Current log-likelihood  380.400899827
Total loglike before proposing:  380.400899827
Metropolis_COLUMN0 proposing.
Metropolis_COLUMN0 value after proposing:  21.7932972121
Metropolis_COLUMN0 rejecting due to ZeroProbability.
Metropolis_COLUMN0 returning.
Step method Metropolis_ORTHOPARA0 stepping

Metropolis_ORTHOPARA0 getting initial logp.
        Metropolis_ORTHOPARA0 Current log-likelihood plus current log-probability 379.154695101
Metropolis_ORTHOPARA0 value before proposing:  0.305109966286
Metropolis_ORTHOPARA0 logp before proposing:  -1.24620472579
        Metropolis_ORTHOPARA0 Current log-likelihood  380.400899827
Total loglike before proposing:  380.400899827
Metropolis_ORTHOPARA0 proposing.
Metropolis_ORTHOPARA0 value after proposing:  0.273640565615
Metropolis_ORTHOPARA0 logp after proposing:  -1.24620472579
        Metropolis_ORTHOPARA0 Current log-likelihood  378.261981021
Total loglike after proposing:  378.261981021
        Metropolis_ORTHOPARA0 Current log-likelihood  378.261981021
Total loglike after-before:  -0.892714080603
        Metropolis_ORTHOPARA0 Current log-likelihood plus current log-probability 377.015776295
logp_p - logp:  -2.13891880639
Metropolis_ORTHOPARA0 rejecting
Metropolis_ORTHOPARA0 returning.
Step method Metropolis_WIDTH0 stepping

Metropolis_WIDTH0 getting initial logp.
        Metropolis_WIDTH0 Current log-likelihood plus current log-probability -48991308.5807
Metropolis_WIDTH0 value before proposing:  0.377511277242
Metropolis_WIDTH0 logp before proposing:  4.02698990453
        Metropolis_WIDTH0 Current log-likelihood  -48991312.6077
Total loglike before proposing:  -48991312.6077
Metropolis_WIDTH0 proposing.
Metropolis_WIDTH0 value after proposing:  0.365065266389
Metropolis_WIDTH0 logp after proposing:  2.30065995595
        Metropolis_WIDTH0 Current log-likelihood  -46759660.2874
Total loglike after proposing:  -46759660.2874
        Metropolis_WIDTH0 Current log-likelihood  -46759660.2874
Total loglike after-before:  2231648.29333
        Metropolis_WIDTH0 Current log-likelihood plus current log-probability -46759657.9867
logp_p - logp:  2231650.59399
Metropolis_WIDTH0 accepting
Metropolis_WIDTH0 returning.
Step method Metropolis_CENTER0 stepping

Metropolis_CENTER0 getting initial logp.
        Metropolis_CENTER0 Current log-likelihood plus current log-probability -46759656.7037
Metropolis_CENTER0 value before proposing:  39.5380208982
Metropolis_CENTER0 logp before proposing:  3.58363953456
        Metropolis_CENTER0 Current log-likelihood  -46759660.2874
Total loglike before proposing:  -46759660.2874
Metropolis_CENTER0 proposing.
Metropolis_CENTER0 value after proposing:  40.1198703896
Metropolis_CENTER0 rejecting due to ZeroProbability.
Metropolis_CENTER0 returning.
Step method Metropolis_TEMPERATURE0 stepping

Metropolis_TEMPERATURE0 getting initial logp.
        Metropolis_TEMPERATURE0 Current log-likelihood plus current log-probability 371.729392447
Metropolis_TEMPERATURE0 value before proposing:  32.1506794847
Metropolis_TEMPERATURE0 logp before proposing:  -3.91202300543
        Metropolis_TEMPERATURE0 Current log-likelihood  375.641415452
Total loglike before proposing:  375.641415452
Metropolis_TEMPERATURE0 proposing.
Metropolis_TEMPERATURE0 value after proposing:  -35.1149960262
Metropolis_TEMPERATURE0 rejecting due to ZeroProbability.
Metropolis_TEMPERATURE0 returning.
MCMC tallying.
MCMC done tallying.

Sampling finished normally.

In [2]: mc40.WIDTH0.value = 0.377511277242; print mc40.logp
379.63442571

In [3]: mc40.WIDTH0.value = 0.365065266389; print mc40.logp
373.148611387

Since WIDTH0 was the only step accepted, I think it's safe to say that the logp evaluations I show only take into account the changes in WIDTH0, but WIDTH0's step is being accepted because of its obscenely high (and wrong) delta-log-likelihood 2231648.29333

fonnesbeck commented 11 years ago

So, WIDTH0 is truncated normal, and you have the two values of that variable before and after proposal. If you can get the values of the parent nodes of WIDTH0 (its mean, variance and truncation point(s)) you will be able to calculate the log-likelihood in each case and verify if the truncated_normal_like function is working as it should.

keflavich commented 11 years ago

All good there:

In [23]: mc40.WIDTH0.parents
Out[23]:
{'a': 0.30747737071703191,
 'b': 0.4489870525060044,
 'mu': 0.37823221161151815,
 'tau': 19975.041146121614}

In [29]: pymc.distributions.normal_like(0.377511277242,0.37823221161151815,19975.04)
Out[29]: 4.026989876146239

In [30]: pymc.distributions.truncated_normal_like(0.377511277242,**mc40.WIDTH0.parents)
Out[30]: 4.0269899045372357

I think the problem is hiding somewhere else...I can't imagine any possible reason to get log likelihoods in the 10,000's when the likelihoods of each individual variable are quite reasonable:

In [32]: for obj in (mc40.stochastics | mc40.observed_stochastics | mc40.potentials):
    print obj.logp
   ....:
-1.24620472579
3.58363953456
380.400899827
-1.60943791243
4.02698990454
-1.60943791243
-3.91202300543

keflavich commented 11 years ago

Ooook, I discovered the problem... wow. This was a stupid, and exceedingly complicated one:

I had attempted to create an MCMC model from another MCMC model by extracting the individual stochastics from the other MCMC model, i.e. by doing this:

funcdict = {k:tempmc40.__dict__[k] for k in parinfo.parnames}

I then proceeded to overwrite some of the stochastics in the dictionary funcdict, but not all of them - some were preserved. The end result is that I have a markov_blanket for each stochastic that includes some stochastics that are not explicitly included in the MCMC fitter! This is a terrible situation, and explains a LOAD of problems I was having!

I have no idea if there are better checks to put in place for this, but @fonnesbeck I greatly appreciate your assistance - you helped me rule out everything that could possibly have been wrong and I found the thoroughly impossible stuff...

fonnesbeck commented 11 years ago

Cool, so perhaps if you put a pdb trace somewhere inside MCMC.sample it would be possible to identify the node that is responsible for the huge jump in the log-likelihood.

fonnesbeck commented 11 years ago

OK, ignore my last comment then. If you have suggestions as to how to better debug these suggestions, feel free to add them to the issue tracker with the enhancement tag attached. I have not run across this before, so I hope it is not a common problem, but I always welcome ways of making the modeling process more foolproof. Keep in mind that PyMC 3 will be quite different,at least in terms of the backend, than PyMC 2 even though the syntax is very similar, so this may not be relevant in the future.

keflavich commented 11 years ago

The bad node is one of my data nodes:

In [34]: mc40.WIDTH0.markov_blanket
Out[34]:
{<pymc.distributions.TruncatedNormal 'ORTHOPARA0' at 0x12b4c9850>,
 <pymc.distributions.TruncatedNormal 'COLUMN0' at 0x125898590>,
 <pymc.distributions.Uniform 'COLUMN0' at 0x12c494890>,
 <pymc.distributions.Normal 'data' at 0x12a5b3890>,
 <pymc.distributions.TruncatedNormal 'CENTER0' at 0x12c491b50>,
 <pymc.distributions.Uniform 'ORTHOPARA0' at 0x12c494b10>,
 <pymc.distributions.Uniform 'DENSITY0' at 0x12c494cd0>,
 <pymc.distributions.TruncatedNormal 'TEMPERATURE0' at 0x12b4c98d0>,
 <pymc.distributions.TruncatedNormal 'WIDTH0' at 0x12c491c90>,
 <pymc.distributions.Normal 'data' at 0x12c494e50>,
 <pymc.distributions.Uniform 'TEMPERATURE0' at 0x12c494c50>,
 <pymc.distributions.TruncatedNormal 'DENSITY0' at 0x12b4b1210>}

One of those data nodes is the "right" one, the other is junk data from the previous MCMC sampler I had made that I had intended to replace. Since it's junk data... no matter what value I put in, it's going to be extremely unlikely, and smaller widths were favored because they tended to push the whole model to zero.

keflavich commented 11 years ago

Regarding your last comment: I really like being able to see the proposed values that are being accepted/rejected, so if you want to consider a patch like https://github.com/keflavich/pymc/commit/27732ed5e23b87df3961684f583a4bfbb2f929cd I'll submit it as a PR.

I'm looking forward to pymc3!

pymc-devs / pymc

Sampler converging on "wrong" answer: Incorrectly evaluating logp? #295