joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
347 stars 76 forks source link

Cannot compute a bounding ellipsoid to a single point if `pointvol` is not specified #302

Closed doublestrong closed 3 years ago

doublestrong commented 3 years ago

https://github.com/joshspeagle/dynesty/blob/243025cbfc5f1941e9dc1b00348c54f8b877bd6c/py/dynesty/bounding.py#L1316

Hi, I encountered this error when using DynamicNestedSampler with hslice and my own function of grad_u. This error shows up immediately when the sampler reaches the dlog_z_init, which is 50.0 in my setting. I think hslice works as my grad_u function was called without errors. ndim of the problem is around 90.

Error message:

create run dir: /home/chad/codebase/project/example/xx/_/res/res_100_/seed0/case0/dyn28
  using dynamic nested sampling
iter: 2681 | batch: 0 | bound: 1 | nc: 1 | ncall: 865688 | eff(%):  0.310 | loglstar:   -inf < -26.582 <    inf | logz: -36.999 +/-  0.914 | dlogz:  0.617 > 50.000                                   Traceback (most recent call last):
  File "/home/chad/codebase/project/example/xx/_/dynamic_ns_incremental.py", line 15, in <module>
    dynesty_run_batch(500, case_dir, data_file, data_format, parallel_config = {'queue_size': 64},
  File "/home/chad/codebase/project/src/xx/RunBatch.py", line 289, in dynesty_run_batch
    sample_arr = solver.sample(live_points=live_points, pool=pool, queue_size=parallel_config['queue_size'],
  File "/home/chad/codebase/project/src/sampler/NestedSampling.py", line 97, in sample
    sampler.run_nested(dlogz_init=dlogz, nlive_init=seed_num,
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1647, in run_nested
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1767, in add_batch
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1128, in sample_batch
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/nestedsamplers.py", line 684, in update
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/bounding.py", line 584, in update
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/bounding.py", line 1316, in bounding_ellipsoid
ValueError: Cannot compute a bounding ellipsoid to a single point if `pointvol` is not specified.

I am using the current development version of dynesty with my own loglikelihood and prior_transform functions. I extracted useful parts in my code so a snippet of my code looks like this:

dns_params={'wt_kwargs': {'pfrac': 1.0},
                                      'stop_kwargs': {'post_thresh': .5},
                                      'nlive_batch': 300,
                                      'maxiter_init': 10000,
                                      'maxiter_batch': 1000,
                                      'maxbatch': 10}
sampler = DynamicNestedSampler(
    loglikelihood=loglike,
    prior_transform=ptform,
    periodic=periodic,
    gradient=grad_u,
    ndim=90, **kwargs)
sampler.run_nested(dlogz_init=50, nlive_init=500,
                       **dns_params)

I really want to see the sampling results ASAP. Is there a hacky way to get around this error?

joshspeagle commented 3 years ago

Sorry for the delay in responding to this. I hadn't realized there was still an edge-case like this lying around; I thought we had cleaned up all uses of pointvol. If you want to define some global variable that has a small non-negative value, that might help, but to be honest the ellipsoid decomposition should never really be hitting a single live point if things are working properly (which might mean more things to look at on dynesty's end). Thanks for bringing this to my attention.

segasai commented 3 years ago

I think we should bail out early here https://github.com/joshspeagle/dynesty/blob/243025cbfc5f1941e9dc1b00348c54f8b877bd6c/py/dynesty/bounding.py#L580 if we get npoints==1

joshspeagle commented 3 years ago

Good suggestion! We can probably add in a condition that npoints needs to be above some baseline threshold and just have the ellipsoids default to the previous ones once that condition is violated.

segasai commented 3 years ago

For the moment, I'd just bail out on npoints==1, because everything else bounding code is capable of dealing with I think. Some cov eigen values will be zero, but they'll be padded.

segasai commented 3 years ago

Actually, I've checked the backtrace of the original error And it looks like the root cause is this https://github.com/joshspeagle/dynesty/blob/243025cbfc5f1941e9dc1b00348c54f8b877bd6c/py/dynesty/dynamicsampler.py#L1110 That leads to selecting single live points. We may need to update the code to be more robust ( or put some diagnostic RuntimeException). If I had to guess we are probably getting weights of [0,0,0,0,1] so we select one single point

doublestrong commented 3 years ago

Sorry for the delay in responding to this. I hadn't realized there was still an edge-case like this lying around; I thought we had cleaned up all uses of pointvol. If you want to define some global variable that has a small non-negative value, that might help, but to be honest the ellipsoid decomposition should never really be hitting a single live point if things are working properly (which might mean more things to look at on dynesty's end). Thanks for bringing this to my attention.

Thanks for the suggestion. I guess I used too few live points.

I feel the in-development version is much slower (in terms of eff.) than the last released version 1.1 when using rslice and hslice. Is it mainly due to this fix? So the slower speed (eff.) is inevitable? :

address #289

With the in-development version, sometimes I even observed dlogz would increase! Is this normal? or it can happen to some tricky problems? When I use hslice and rslice to sample a problem with ndim=90, the dlogz in the print often pause for a long time (say 20 sec.) and then change a little bit (say 400->397). Sometimes it would increase.

Sorry for the very vague description. I could not provide visualization of results since my function based on dynesty didn't return a reasonable solution. But it worked very well for smaller problems with ndim=50.

Thanks for this wonderful package and great discussion across different issues!

segasai commented 3 years ago

I feel the in-development version is much slower (in terms of eff.) than the last released version 1.1 when using rslice and hslice. Is it mainly due to this fix? So the slower speed (eff.) is inevitable? :

address #289

The change of the default is expected to lead to actually more correct evidences for intermediate ndims (several tens). I woudn't vouch for correct answers for ndim=100 though. Also you can lower yourself the number of slices if you want (rather then using the default that will scale with ndim), but i wouldn't recommend it.

Also, it'd be good if you can consistently get the error that you showed above to try the branch from here https://github.com/joshspeagle/dynesty/pull/305 Then I should be able to fix the underlying issue (the small number of live points is not really the main cause here)

doublestrong commented 3 years ago

I feel the in-development version is much slower (in terms of eff.) than the last released version 1.1 when using rslice and hslice. Is it mainly due to this fix? So the slower speed (eff.) is inevitable? :

address #289

The change of the default is expected to lead to actually more correct evidences for intermediate ndims (several tens). I woudn't vouch for correct answers for ndim=100 though. Also you can lower yourself the number of slices if you want (rather then using the default that will scale with ndim), but i wouldn't recommend it.

Thanks for the comments and suggestion. Yea, I feel my multi-modal problem with 100 dimensions is just too difficult. I decided to use some smaller problems for my project.

Also, it'd be good if you can consistently get the error that you showed above to try the branch from here #305 Then I should be able to fix the underlying issue (the small number of live points is not really the main cause here)

Thanks. I installed your bound_fix branch and ran an experiment with same settings as above. This run used hslice as sampling method. Hope the error message is what you expected:

/home/chad/codebase/project/venv/bin/python /home/chad/codebase/project/example/xxx/mww/dynamic_ns_incremental_cp.py
create run dir: /home/chad/codebase/project/example/xxx/mww/res/res_100_/seed0/case0/dyn17
saving config of sampling
 Dim of problem: 126
 Number of seeds: 500
 maxcall: 5000000
 maxiter: 50000
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 50000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 100000, 'use_stop': False, 'dlogz_init': 50}
iter: 4628 | batch: 0 | bound: 9 | nc: 1 | ncall: 5620446 | eff(%):  0.082 | loglstar:   -inf < -46.886 <    inf | logz: -61.618 +/-  1.258 | dlogz:  0.836 > 50.000                                  Traceback (most recent call last):
  File "/home/chad/codebase/project/example/xxx/mww/dynamic_ns_incremental_cp.py", line 17, in <module>
    dynesty_run_batch(500, case_dir, data_file, data_format, parallel_config = {'queue_size': 64},
  File "/home/chad/codebase/project/src/sampler/NestedSampling.py", line 213, in dynesty_run_batch
    sample_arr = solver.sample(live_points=live_points, pool=pool, queue_size=parallel_config['queue_size'],
  File "/home/chad/codebase/project/src/sampler/NestedSampling.py", line 127, in sample
    sampler.run_nested(**dns_params)
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1662, in run_nested
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1782, in add_batch
  File "/home/chad/codebase/project/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1089, in sample_batch
RuntimeError: Could only find a single live point in the required logl interval

Process finished with exit code 1

Another experiment with rslice:

/home/chad/codebase/_/venv/bin/python /home/chad/codebase/_/example/xx/_/dynamic_ns_incremental_cp.py
create run dir: /home/chad/codebase/_/example/xx/_/res/res_100_/seed0/case0/dyn19
saving config of sampling
 Dim of problem: 126
 Number of seeds: 500
 maxcall: 5000000
 maxiter: 50000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=32>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 50000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 100000, 'use_stop': False, 'dlogz_init': 50}
iter: 4545 | batch: 0 | bound: 16 | nc: 1 | ncall: 1882486 | eff(%):  0.241 | loglstar:   -inf < -45.071 <    inf | logz: -59.749 +/-  1.204 | dlogz:  0.901 > 50.000                                 Traceback (most recent call last):
  File "/home/chad/codebase/_/example/xx/_/dynamic_ns_incremental_cp.py", line 17, in <module>
    dynesty_run_batch(500, case_dir, data_file, data_format, parallel_config = {'queue_size': 64},
  File "/home/chad/codebase/_/src/sampler/NestedSampling.py", line 215, in dynesty_run_batch
    sample_arr = solver.sample(live_points=live_points, pool=pool, queue_size=parallel_config['queue_size'],
  File "/home/chad/codebase/_/src/sampler/NestedSampling.py", line 129, in sample
    sampler.run_nested(**dns_params)
  File "/home/chad/codebase/_/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1662, in run_nested
  File "/home/chad/codebase/_/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1782, in add_batch
  File "/home/chad/codebase/_/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1089, in sample_batch
RuntimeError: Could only find a single live point in the required logl interval

Process finished with exit code 1
segasai commented 3 years ago

As a temporary fix, I'm suggesting adding pad=2 (or some integer >1) to wt_kwargs. AFAIU it should fix the issue. (but there is a better fix I think, but I wonder if my suggestion works )

doublestrong commented 3 years ago

As a temporary fix, I'm suggesting adding pad=2 (or some integer >1) to wt_kwargs. AFAIU it should fix the issue. (but there is a better fix I think, but I wonder if my suggestion works )

Thanks. Will do it and keep you posted.

segasai commented 3 years ago

(alternatively you can pull again from my bound_fix branch, I think it should resolve the problem as well)

doublestrong commented 3 years ago

(alternatively you can pull again from my bound_fix branch, I think it should resolve the problem as well)

I am using the latest version from your branch. The error this time is different. Let me know if you want more tests.

saving config of sampling
 Dim of problem: 126
 Number of seeds: 500
 maxcall: 5000000
 maxiter: 50000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=32>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 50000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 100000, 'use_stop': False, 'dlogz_init': 50}
iter: 4666 | batch: 0 | bound: 16 | nc: 1 | ncall: 1861804 | eff(%):  0.251 | loglstar:   -inf < -45.668 <    inf | logz: -60.549 +/-  1.188 | dlogz:  0.878 > 50.000                                 Traceback (most recent call last):
  File "/home/chad/xx/dynamic_ns_incremental_cp.py", line 17, in <module>
    dynesty_run_batch(500, case_dir, data_file, data_format, parallel_config = {'queue_size': 64},
  File "/home/chad/codebase/xx/src/sampler/NestedSampling.py", line 215, in dynesty_run_batch
    sample_arr = solver.sample(live_points=live_points, pool=pool, queue_size=parallel_config['queue_size'],
  File "/home/chad/codebase/xx/src/sampler/NestedSampling.py", line 129, in sample
    sampler.run_nested(**dns_params)
  File "/home/chad/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1669, in run_nested
  File "/home/chad/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1789, in add_batch
  File "/home/chad/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1093, in sample_batch
RuntimeError: Could not find live points in the required logl interval
segasai commented 3 years ago

Thanks for testing it further. There is something weird going on here that I'm missing. Because the logl interval should always be based on the live points and therefore must contain >0 of them. I've added some additional diagnostic info to the RuntimeError, as well as a few asserts in other places, so I would appreciate if you could test again. (if you have a code that reproduces the failure that you could share (even privately) that'd be even better, as there is only as much one can do with info in exceptions.

doublestrong commented 3 years ago

Hi, here is the result based on the latest bound_fix branch. My function is part of a big project. For now, doing tests by myself may be more convenient to me. Sorry about that. I am happy to do more tests.

saving config of sampling
 Dim of problem: 126
 Number of seeds: 500
 maxcall: 5000000
 maxiter: 50000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=32>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 50000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 100000, 'use_stop': False, 'dlogz_init': 50}
iter: 4584 | batch: 0 | bound: 15 | nc: 1 | ncall: 1906137 | eff(%):  0.240 | loglstar:   -inf < -45.017 <    inf | logz: -59.494 +/-  1.051 | dlogz:  0.744 > 50.000                                 Traceback (most recent call last):
  File "/home/xx/codebase/xx/example/xx/xx/dynamic_ns_incremental_cp.py", line 17, in <module>
    dynesty_run_batch(500, case_dir, data_file, data_format, parallel_config = {'queue_size': 64},
  File "/home/xx/codebase/xx/src/sampler/NestedSampling.py", line 215, in dynesty_run_batch
    sample_arr = solver.sample(live_points=live_points, pool=pool, queue_size=parallel_config['queue_size'],
  File "/home/xx/codebase/xx/src/sampler/NestedSampling.py", line 129, in sample
    sampler.run_nested(**dns_params)
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1648, in run_nested
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1770, in add_batch
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1097, in sample_batch
RuntimeError: ('Could not find live points in the required logl interval. Please report!\nDiagnostics. logl_min: -45.01722379213999 ', 'logl_bounds: (-45.01722379213999, -45.01722379213999) ', 'saved_loglmax: -45.01722379213999')

Process finished with exit code 1
segasai commented 3 years ago

(I assume that you don't have plateau's in your likelihood ? ) If not, then. I'm not sure that's very satistfying, but based on the info you've provided, I'd say you may need more live points, as you have in such a high-D problem. I think I have suspicion of what's happening (I've seen this myself) is that when you sample the posterior, you may uncover one high likelihood point ( one peak of the posterior, which is saved), but then when sampling is continued (in a batch), the live-points may easily 'escape' into another mode that is actually lower in peak logL. There the algorithm will try to sample many many times till it reaches the the highest likelihood value ever seen (which is impossible with this posterior mode), when it does that, it essentially becomes more and more concentrated in the peak of the posterior with logl values being essentially constant. That's what the diagnostic show -- 'logl_bounds: (-45.01722379213999, -45.01722379213999) ' Basically imagine very high gaussian with very small posterior volume and high logL, vs broad gaussian and low logL. Depending on exact numbers, it's feasible that the batch points will be selected in a broad gaussian and thus will be stuck there unable to reach the high logL. That's the speculation on my side.

Also the full output of all the iterations may be useful here: iter: 4584 | batch: 0 | bound: 15 | nc: 1 | ncall: 1906137 | eff(%): 0.240 | loglstar: -inf < -45.017 < inf | logz: -59.494 +/- 1.051 | dlogz: 0.744 > 50.000

doublestrong commented 3 years ago

Thanks for the prompt reply. No, we don't have plateau in likelihood.

I think what you reasoned makes sense. My problem is indeed a multi-modal posterior and I am pretty sure that in the posterior a highly peaked mode can be accompanied by a few flat and broad modes. It is totally possible that when the sampler starts to draw another batch, it could not find a new live point with higher log-likelihood.

I will test the same problem with greater nlive_init if you think that will be helpful.

segasai commented 3 years ago

On further examination, I noticed that the diagnostic line just before the error iter: 4584 | batch: 0 | bound: 15 | nc: 1 | ncall: 1906137 | eff(%): 0.240 | loglstar: -inf < -45.017 < inf | logz: -59.494 +/- 1.051 | dlogz: 0.744 > 50.000

seems to indicate that it just finished initial sampling and it's the very first batch that fails to sample. Also I noticed that you have use_stop=False, which made me wonder if that's the cause (because the initial run ran for too long...)
So I'd try to get rid of that as well. Also, what version(commit) are using ?

doublestrong commented 3 years ago

seems to indicate that it just finished initial sampling and it's the very first batch that fails to sample.

yes, it is exactly the case. Previous errors immediately showed up when the sampler reached the goal of dlog_z_init for initial sampling.

Also I noticed that you have use_stop=False, which made me wonder if that's the cause (because the initial run ran for too long...)

I set it to False as I feel the default stopping criteria is fairly time-consuming in my case. Actually I have never had the patience to get the outcome of default stopping_criteria evaluation. I followed the suggestion in dynesty documentation to disable it.

So I'd try to get rid of that as well. Also, what version(commit) are using ?

I cannot remember exactly but I think my previous test used fe65dbcf84eca22d701bf634dbc3e1272dc2aec0 at the latest. I enable use_stop in my ongoing test. I am using this commit: e21e03c00c768ee2e2855cd58086ecb016adda22. I increase the nlive_init to 5000; however, based on my experience with the developing rslice, this test may take a while (more than 5 hours I think). I will let you know the result of this test.

Running stdout info:

saving config of sampling
 Dim of problem: 126
 Number of seeds: 5000
 maxcall: 50000000
 maxiter: 500000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=51>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 500000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 10000000, 'use_stop': True, 'dlogz_init': 50}
iter: 21752 | batch: 0 | bound: 3 | nc: 810 | ncall: 3162018 | eff(%):  0.687 | loglstar:   -inf < -469.074 <    inf | logz: -478.715 +/-  0.041 | dlogz: 424.569 > 50.000 
segasai commented 3 years ago

seems to indicate that it just finished initial sampling and it's the very first batch that fails to sample.

yes, it is exactly the case. Previous errors immediately showed up when the sampler reached the goal of dlog_z_init for initial sampling.

It would also be useful if (for your previous configuration) you could save and send the the object -- sampler.saved_run.D which is a dictionary with all the things in the run.

The easiest way is just

to put in this

import pickle
with open('dump.pkl','wb') as fp:
     pickle.dump(self.sampler.saved_run.D, fp)

just before the RuntimeError that is being thrown. I think with that in hand things should be pretty clear.

doublestrong commented 3 years ago

seems to indicate that it just finished initial sampling and it's the very first batch that fails to sample.

yes, it is exactly the case. Previous errors immediately showed up when the sampler reached the goal of dlog_z_init for initial sampling.

Also I noticed that you have use_stop=False, which made me wonder if that's the cause (because the initial run ran for too long...)

I set it to False as I feel the default stopping criteria is fairly time-consuming in my case. Actually I have never had the patience to get the outcome of default stopping_criteria evaluation. I followed the suggestion in dynesty documentation to disable it.

So I'd try to get rid of that as well. Also, what version(commit) are using ?

I cannot remember exactly but I think my previous test used fe65dbc at the latest. I enable use_stop in my ongoing test. I am using this commit: e21e03c. I increase the nlive_init to 5000; however, based on my experience with the developing rslice, this test may take a while (more than 5 hours I think). I will let you know the result of this test.

Running stdout info:

saving config of sampling
 Dim of problem: 126
 Number of seeds: 5000
 maxcall: 50000000
 maxiter: 500000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=51>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 500000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 10000000, 'use_stop': True, 'dlogz_init': 50}
iter: 21752 | batch: 0 | bound: 3 | nc: 810 | ncall: 3162018 | eff(%):  0.687 | loglstar:   -inf < -469.074 <    inf | logz: -478.715 +/-  0.041 | dlogz: 424.569 > 50.000 

The final STDOUT for this task:

saving config of sampling
 Dim of problem: 126
 Number of seeds: 5000
 maxcall: 50000000
 maxiter: 500000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=51>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 500000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 10000000, 'use_stop': True, 'dlogz_init': 50}
iter: 48130 | batch: 0 | bound: 16 | nc: 1 | ncall: 19624603 | eff(%):  0.245 | loglstar:   -inf < -42.899 <    inf | logz: -59.377 +/-  0.831 | dlogz:  0.415 > 50.000                               Traceback (most recent call last):
  File "/home/xx/codebase/xx/example/xx/xx/dynamic_ns_incremental_cp.py", line 17, in <module>
    dynesty_run_batch(5000, case_dir, data_file, data_format, parallel_config = {'cpu_frac':.8, 'queue_size': 64},
  File "/home/xx/codebase/xx/src/sampler/NestedSampling.py", line 215, in dynesty_run_batch
    sample_arr = solver.sample(live_points=live_points, pool=pool, queue_size=parallel_config['queue_size'],
  File "/home/xx/codebase/xx/src/sampler/NestedSampling.py", line 129, in sample
    sampler.run_nested(**dns_params)
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1647, in run_nested
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1769, in add_batch
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1096, in sample_batch
RuntimeError: ('Could not find live points in the required logl interval. Please report!\nDiagnostics. logl_min: -42.89898447129433 ', 'logl_bounds: (-42.89898447129433, -42.89898447129433) ', 'saved_loglmax: -42.89898447129433')

Process finished with exit code 1
doublestrong commented 3 years ago

seems to indicate that it just finished initial sampling and it's the very first batch that fails to sample.

yes, it is exactly the case. Previous errors immediately showed up when the sampler reached the goal of dlog_z_init for initial sampling.

It would also be useful if (for your previous configuration) you could save and send the the object -- sampler.saved_run.D which is a dictionary with all the things in the run.

The easiest way is just

to put in this

import pickle
with open('dump.pkl','wb') as fp:
     pickle.dump(self.sampler.saved_run.D, fp)

just before the RuntimeError that is being thrown. I think with that in hand things should be pretty clear.

I did another test with previous configuration (nlive_init = 500, use_stop = False). I will send the dump.pkl file to you by email. @segasai But the dict is much shorter than what I expected....:

{'id': [], 'u': [], 'v': [], 'logl': [-87.9897314947458], 'logvol': [-8.565437414879442], 'logwt': [], 'logz': [-99.58990562536727], 'logzvar': [0.021000296711388648], 'h': [10.510645007496194], 'nc': [], 'boundidx': [], 'it': [], 'n': [], 'bounditer': [], 'scale': []}

The corresponding STDOUT is

saving config of sampling
 Dim of problem: 126
 Number of seeds: 500
 maxcall: 5000000
 maxiter: 50000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=51>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 500000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 10000000, 'use_stop': False, 'dlogz_init': 50.0}
iter: 4787 | batch: 0 | bound: 17 | nc: 1 | ncall: 1948093 | eff(%):  0.246 | loglstar:   -inf < -41.030 <    inf | logz: -56.491 +/-  1.166 | dlogz:  1.090 > 50.000                                 Traceback (most recent call last):
  File "/home/xx/codebase/xx/example/slam/xx/dynamic_ns_incremental_cp.py", line 17, in <module>
    dynesty_run_batch(500, case_dir, data_file, data_format, parallel_config = {'cpu_frac':.8, 'queue_size': 64},
  File "/home/xx/codebase/xx/src/sampler/NestedSampling.py", line 215, in dynesty_run_batch
    sample_arr = solver.sample(live_points=live_points, pool=pool, queue_size=parallel_config['queue_size'],
  File "/home/xx/codebase/xx/src/sampler/NestedSampling.py", line 129, in sample
    sampler.run_nested(**dns_params)
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1652, in run_nested
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1774, in add_batch
  File "/home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py", line 1099, in sample_batch
RuntimeError: ('Could not find live points in the required logl interval. Please report!\nDiagnostics. logl_min: -41.02959340104449 ', 'logl_bounds: (-41.02959340104449, -41.02959340104449) ', 'saved_loglmax: -41.02959340104449')

Process finished with exit code 1

I didn't update my local dynesty so the test was still performed with the deleted bound_fix branch. But hope the file may still be a bit helpful.

joshspeagle commented 3 years ago

Hi @doublestrong: I've merged in several commits from @segasai into the main branch that we hope should address this issue. We suspect this is because of the large dlogz_init you are using, which means you terminate sampling extremely early and therefore have almost all the weight at the final sampled point. Then, during the next phase of sampling when it tries to estimate where it should allocate more points, it would just pick only the endpoint for both the maximum and minimum value, leading to your output of

logl_bounds: (-41.02959340104449, -41.02959340104449)

We've added in several checks to now prevent this type of behavior, including redoing how we select the final interval in this edge case (padding away from the edge) as well as how we try to resample the final set of points to construct the new bounding distributions (for the next phase of sampling). I believe those two changes should help to mitigate this behaviour, but please let us know after you update to the latest if you still find this failing for the same reason (since given your setup I anticipate the sampling in batches will experience this exact same behavior as well).

doublestrong commented 3 years ago

Hi @joshspeagle Thanks!!! After applying this commit 24769130d3e3fca20b2b76f8cb7a4311e3a56fe3 made by @segasai, my program does not stop after the initial sampling stage and it keeps sampling other batches. This is the running info:

saving config of sampling
 Dim of problem: 126
 Number of seeds: 500
 maxcall: 5000000
 maxiter: 50000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=51>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 500000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 10000000, 'use_stop': False, 'dlogz_init': 50.0}
iter: 5243 | batch: 1 | bound: 29 | nc: 780 | ncall: 2415632 | eff(%):  0.198 | loglstar: -51.749 < -62.137 < -49.139 | logz: -62.942 +/-  0.657 | stop:    nan                                       /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 5543 | batch: 2 | bound: 30 | nc: 696 | ncall: 2643205 | eff(%):  0.210 | loglstar: -44.970 < -60.084 < -43.332 | logz: -59.231 +/-  2.740 | stop:    nan 

It looks like the bug I raised at the beginning of this issue has been resolved. I will keep you updated with the final results to make sure that the changes will mitigate the issue for batches as well. I set the number of batches to 10 so the results will be out tomorrow. Thanks again!!!

doublestrong commented 3 years ago

This is a part of the result for that problem with dim=126 using dynamic nested sampling and rslice.

STDOUT:

saving config of sampling
 Dim of problem: 126
 Number of seeds: 500
 maxcall: 5000000
 maxiter: 50000
sampler kwargs:
{'pool': <multiprocessing.pool.Pool state=RUN pool_size=51>, 'queue_size': 64, 'sample': 'rslice'}
  using dynamic nested sampling
{'wt_kwargs': {'pfrac': 1.0}, 'stop_kwargs': {'post_thresh': 0.5}, 'nlive_batch': 300, 'maxiter_init': 500000, 'maxiter_batch': 1000, 'maxbatch': 10, 'maxiter': 10000000, 'use_stop': False, 'dlogz_init': 50.0}
iter: 5243 | batch: 1 | bound: 29 | nc: 780 | ncall: 2415632 | eff(%):  0.198 | loglstar: -51.749 < -62.137 < -49.139 | logz: -62.942 +/-  0.657 | stop:    nan                                       /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 6243 | batch: 2 | bound: 43 | nc: 1511 | ncall: 3221757 | eff(%):  0.181 | loglstar: -44.970 < -51.384 < -43.332 | logz: -59.231 +/-  2.740 | stop:    nan                                      /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 7243 | batch: 3 | bound: 55 | nc: 733 | ncall: 4029046 | eff(%):  0.170 | loglstar: -43.332 < -46.826 < -42.909 | logz: -59.107 +/-  1.811 | stop:    nan                                       /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 8243 | batch: 4 | bound: 67 | nc: 662 | ncall: 4842905 | eff(%):  0.162 | loglstar: -42.461 < -43.858 < -41.704 | logz: -58.874 +/-  1.246 | stop:    nan                                       /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 9243 | batch: 5 | bound: 79 | nc: 758 | ncall: 5675938 | eff(%):  0.156 | loglstar: -40.564 < -42.019 < -40.086 | logz: -58.500 +/-  0.924 | stop:    nan                                       /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 10243 | batch: 6 | bound: 91 | nc: 1635 | ncall: 6531469 | eff(%):  0.151 | loglstar: -39.539 < -40.759 < -39.169 | logz: -58.403 +/-  0.609 | stop:    nan                                     /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 11243 | batch: 7 | bound: 102 | nc: 681 | ncall: 7325910 | eff(%):  0.149 | loglstar: -38.825 < -39.838 < -38.312 | logz: -58.405 +/-  0.364 | stop:    nan                                     /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 12243 | batch: 8 | bound: 114 | nc: 756 | ncall: 8138721 | eff(%):  0.146 | loglstar: -38.149 < -39.161 < -37.136 | logz: -58.397 +/-  0.257 | stop:    nan                                     /home/xx/codebase/xx/venv/lib/python3.8/site-packages/dynesty-1.1-py3.8.egg/dynesty/dynamicsampler.py:1325: UserWarning: Warning. The maximum likelihood not reached in the batch. You may not have enough livepoints
iter: 13819 | batch: 10 | bound: 128 | nc: 783 | ncall: 9374294 | eff(%):  0.147 | loglstar: -45.133 < -38.965 < -44.645 | logz: -58.389 +/-  0.182 | stop:    nan                                    
Sampling time: 14694.23705124855 sec

Saved summary:

{"niter": 13819, "ncall": 9374294, "eff": 0.14741376790614846, "logz": -58.284771541599156, "logzerr": 0.1786152156353317}

I have plotted the resulting samples and they look reasonable to me. Thanks for tracing down bugs to resolve the issue and all discussions!!! Thank you very much!!! @segasai @joshspeagle

joshspeagle commented 3 years ago

Fantastic! So glad we could resolve this 😄.