joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
347 stars 76 forks source link

Params overflowing to inf #209

Closed CalumGabbutt closed 3 years ago

CalumGabbutt commented 3 years ago

I'm currently using dynesty to perform static nested sampling on a custom likelihood function with 'multi' bounds and 'rwalk' sampling (1500 nlive points). The sampler runs perfectly 99% of the time, however for some of the data samples, the runs fail with the error:

Exception while calling loglikelihood function: params: [ inf 2.86408389e-02 2.22182571e-02 4.04356563e-02 9.23900098e-01 1.17080224e+02 5.64710836e+01 2.27116477e+02 1.33147622e+02 1.88771848e+02 1.46173472e+02 1.30782723e+02 7.62809350e+01 1.34494380e+02 5.20331440e+00 9.39797359e+01 1.34037074e+02 1.10568351e+02 1.17584204e+02 1.15930547e+02 1.23936813e+02 1.29542150e+02 1.02526209e+02 5.47408794e+01 8.79609015e+01 1.21207516e+02 9.86264815e+01 9.85764264e+01 1.25517338e+02 1.77819488e+02 7.18671872e+01 1.05594272e+02 2.60180178e+01 1.91192474e+01 4.99152480e+01 1.79858705e+01 5.50634020e+01 1.46721796e+02 8.48983385e+01 1.74806408e+02 5.72236037e+01 1.71734397e+02] args: [] kwargs: {}

Based on the values of the other runs, the "true" value of params[0] is ~1, so the sampler overflowing is a bit surprising. I know this many parameters is pushing what 'rwalk' can sample effectively, but it does seem to work for the majority of samples. Slice sampling seems to take so long as to be unusable. Also, this error occurs very late in the sample run. Do you have any ideas on how to fix this issue?

joshspeagle commented 3 years ago

I have no idea what could be causing such behavior. What do the samples look like prior to this error?

CalumGabbutt commented 3 years ago

I'm running the sampling on a HPC, so unsuccessful sampling runs aren't saved. However, here is a copy of the output log of the samples, does that help? chain4_O7.o2412976.17.txt

joshspeagle commented 3 years ago

That does help a bit. The run actually looks like it's sampling just fine, except that it's clearly reaching some peak in log-likelihood by the end. There are also some warning messages implying it might be near an edge. This implies that the first parameter might be poorly-behaved prior to failure, but it's hard to say for sure without some printouts of prior parameter values. Is there possibly a weird solution where that first parameter can race off to infinity?

CalumGabbutt commented 3 years ago

There is a degree of collinearity between the first, second and third parameters (each of them is a rate parameter in a set of differential equations, which is solved using matrix exponentiation along with a known time parameter), could that lead to the odd behaviour? Here is a pickled (using joblib) sample for the same data but with a slightly different model that did finish successfully ( https://drive.google.com/file/d/1U3RcRUkBk3OAPahUKmR1Ebe9fCs4mtuS/view?usp=sharing )

joshspeagle commented 3 years ago

That could be part of the issue. I don't have time today to dig into the pickle file, but I'll try and take a look at some point soon-ish and see if I can give some additional feedback if anything pops out.

CalumGabbutt commented 3 years ago

Thank you, if you have an explanation of how to save the incomplete samples, I can have a go at generating samples from a run with the above error?

joshspeagle commented 3 years ago

One solution I like is to save the output in batches. If you specify something like maxiter then you can easily do something like:

for i in range(1, n):

    sampler.run_nested(maxiter=i*batch, ...)
    res = sampler.results
    pickle.dump(res, 'res_{}.pkl'.format(i))

Since the sampler restarts where it left off, this generally should work unless you're doing something weird with the memory. You can be more sophisticated, but that basic structure should work as a place to start.

CalumGabbutt commented 3 years ago

Here are the samples generated before the sampler crashed (only nlive=200, but the ValueError occurs at a similar logz value) https://drive.google.com/file/d/19IFEMGTiDo_IrAfOUlmyz5xVK5MEBW39/view?usp=sharing

joshspeagle commented 3 years ago

Thanks. I’ll try to take a look over the next few days and get back to you.

joshspeagle commented 3 years ago

A very late follow-up to this, but I believe the recent improvements to the stability and behaviour of the bounding distributions (#219 and others) should now resolve this and other similar issues, so I'm tentatively closing this.