dynamic sampler bug, incorrect logl interval

segasai commented 3 years ago

While running dynesty dynamic sampler thousands of times I'm occasionally hitting the error "cannot find live points in the required logl interval". To debug it I put additional print there. See below:

  File "ZZZ/pyenv38/lib/python3.6/site-packages/dynesty
/dynamicsampler.py", line 1669, in run_nested
    stop_val=stop_val)
  File "ZZZ/pyenv38/lib/python3.6/site-packages/dynesty
/dynamicsampler.py", line 1775, in add_batch
    save_bounds=save_bounds):
File "ZZZ/pyenv38/lib/python3.6/site-packages/dynesty/dynamicsampler.py", line 1118, in sample_batch
    raise RuntimeError('Could not find live points in the required logl interval %f %f %f %f %f %d %d %d'%(logl_min, live_logl.min(), live_logl.max(), base_logl.min(), base_logl.max(), uidx, r, i))
RuntimeError: Could not find live points in the required logl interval 2596792.346555 2596788.423530 2596790.531368 -50000000000000000000.000000 2596790.531368 435 -501 1

The important point is logl_min is > base_logl.max(). This tells me that the sample_batch here https://github.com/joshspeagle/dynesty/blob/1a3cce6f490f275919208e3bdd154139ab20b117/py/dynesty/dynamicsampler.py#L1780 is called with the logl low boundary that is above the logl of the base run. I can't immediately figure out why this happens. (and since the error occurs sporadically on a cluster can't really debug it). So I'll put it here for the moment. I wonder if the weight function is incorrect somehow since it provides the bounds...

segasai commented 3 years ago

Okay, now I think I understand what's going on. The weight function called here https://github.com/joshspeagle/dynesty/blob/1a3cce6f490f275919208e3bdd154139ab20b117/py/dynesty/dynamicsampler.py#L1779 uses the results() data which is based on saved_logl (i.e. including multiple batch runs) however sample_batch() relies on the base_logl when selecting live-points Therefore it's entirely possible for the weightfunction to return the interval that sits above the highest likelihood of the base run. I think the solution is probably either use the saved* in sample_batch rather than the base run or somehow skip the intervals that sit above the base run (in some sense that means that the base run missed the tip of the posterior)

joshspeagle commented 3 years ago

This is a pretty good find -- nice sleuthing. I agree the best solution would probably be to use the saved_* quantities if possible. Did you already start looking into fixes, or should I also take a look?

segasai commented 3 years ago

It'd be better if you look at this, as it'll take me more time to figure things out. (I may look at it in a few days, if there won't be progress )

joshspeagle commented 3 years ago

Okay, sounds good. Let me start looking at it tomorrow/Friday. Feel free to ping me anytime if you don't hear any updates before next Monday.

segasai commented 3 years ago

On that topic, I think it would be good to refactor the dynamicsampler with something like this https://github.com/segasai/dynesty/commit/316a2314ccdc957504f74d2f2452bea265cad18c because currently the code is full of

new_u.append(u)
saved_u.append(u)

and it's hard to verify correctness with so much duplication.

Plus I wasn't 100 % sure all the relevant arrays were properly kept in sync.

segasai commented 3 years ago

The original issue is still quite painful. I've tried to change the references from base_run to saved_run in sample_batch, but the regression tests fail after that I think due to the mismatch between number of livepoints in batches vs base run and sample_batch() cannot deal with that. I don't quite know how to fix that.

segasai commented 3 years ago

I tried to tackle it in 6f3587de0faaae3c506a4a316597e9540dee2b53 But I don't know if that's is correct. (two tests fail, I don't know if it's just random noise or not).

joshspeagle commented 3 years ago

This should now be resolved via #248, so closing this for now.

joshspeagle / dynesty

dynamic sampler bug, incorrect logl interval #244