joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
357 stars 77 forks source link

How to resume a run if the sampler crashes #194

Closed ruskin23 closed 4 years ago

ruskin23 commented 4 years ago

Hi

I will be using dynamic nested sampler on HPC cluster for my project. The problem is that my likelihood function is extremely expensive and the cluster has limits on runtime (4 days). Hence, It is possible that the sampler will stop while updating the live points and I have to queue it up again. Is there a way to continue the run from the last iteration?

joshspeagle commented 4 years ago

Yes, although you have to dump a copy of the sampler to disk before you time out and lose all data. You can do so by running the sampler as a generator following the syntax here (which shows that the main nested sampling loop you call externally just is a thin wrapper over an internal generator) and just putting in some pickle.dump statements that execute after some number of iterations. When you reload the sampler, you'll need to re-instantiate several modules (which can't be pickled and therefore get deleted). See #188 and #181 for some additional details.

Hope this helps. Please let me know if you have any additional questions.

ruskin23 commented 4 years ago

Thanks, This was helpful.

ruskin23 commented 4 years ago

Hello again

I resumed my work using dynesty. I am trying to dump the sampler just after the initial run as:

        dsampler=DynamicNestedSampler(self.loglikelihood, self.prior_transform,
                                       self.ndim,pool=self.pool,queue_size=self.queue_size)

        print_func=None
        print_progress=True
        pbar,print_func=dsampler._get_print_func(print_func,print_progress)

        niter=1
        ncall=0

        for results in dsampler.sample_initial():
            (worst, ustar, vstar, loglstar, logvol,
             logwt, logz, logzvar, h, nc, worst_it,
             boundidx, bounditer, eff, delta_logz) = results

            niter+=1
            ncall+=nc

            print_func(results,niter,ncall,nbatch=0,dlogz=0.01,logl_max=numpy.inf)

        with open('initial_samples.dill','wb') as f:
            dill.dump(dsampler,f)

But I am getting the error:

8401it [01:43, 157.49it/s, batch: 0 | bound: 23 | nc: 1 | ncall: 34248 | eff(%): 24.486 | loglstar:   -inf < -77.546 <    inf | logz: -88.746 +/-  0.188 | dlogz:  0.001 >  0.010]
Traceback (most recent call last):
  File "test_main.py", line 196, in <module>
    S(status)
  File "test_main.py", line 119, in __call__
    dill.dump(dsampler,f)
  File "/home/ruskin/.local/lib/python3.6/site-packages/dill/_dill.py", line 259, in dump
    Pickler(file, protocol, **_kwds).dump(obj)
  File "/home/ruskin/.local/lib/python3.6/site-packages/dill/_dill.py", line 446, in dump
    StockPickler.dump(self, obj)
  File "/usr/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/usr/lib/python3.6/pickle.py", line 521, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.6/pickle.py", line 634, in save_reduce
    save(state)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/ruskin/.local/lib/python3.6/site-packages/dill/_dill.py", line 933, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 496, in save
    rv = reduce(self.proto)
  File "/home/ruskin/.local/lib/python3.6/site-packages/dynesty-1.0.2-py3.6.egg/dynesty/sampler.py", line 161, in __getstate__
KeyError: 'rstate'
8459it [01:43, 81.61it/s, batch: 0 | bound: 23 | nc: 1 | ncall: 34306 | eff(%): 24.655 | loglstar:   -inf < -77.454 <    inf | logz: -88.745 +/-  0.189 | dlogz:  0.000 >  0.010] 

This KeyError was not coming before #192 was fixed. I would appreciate any help on this issue.

joshspeagle commented 4 years ago

Looks like this is something to do with the random state not being properly found by __getstate__. If this is happening this first time you try to dump it, you should be able to modify your local version to make sure it gets properly identified and can submit a PR to fix it (or just point out the line to me and I'll eventually get around to it). If it's happening after the first attempt, it might be due to issues re-instantiating the random state (as described in earlier comments).

ruskin23 commented 4 years ago

The issue was coming after initial sampling. I included a try except statement and its working fine. I have created a pull request. Let me know if it looks okay.

kcroker commented 2 years ago

Yes, although you have to dump a copy of the sampler to disk before you time out and lose all data. You can do so by running the sampler as a generator following the syntax here (which shows that the main nested sampling loop you call externally just is a thin wrapper over an internal generator) and just putting in some pickle.dump statements that execute after some number of iterations. When you reload the sampler, you'll need to re-instantiate several modules (which can't be pickled and therefore get deleted). See #188 and #181 for some additional details.

Hope this helps. Please let me know if you have any additional questions.

The link to the example generator code no longer works. Could you provide an updated link?

segasai commented 2 years ago

https://dynesty.readthedocs.io/en/stable/quickstart.html?highlight=generator#running-externally -- that's a link