Closed timothygebhard closed 6 months ago
Hi Timothy, thanks for reaching out! You're right that interrupting nautilus midway can lead to corrupted HDF5 checkpoint files. Implementing a time limit shouldn't be too difficult but I'm wondering whether the n_like_max
keyword argument of run()
would already solve your problem. Once nautilus hits a specific maximum number of likelihood calls given by n_like_max
, it'll stop. By default that maximum is infinity but you can set it to a lower value.
I think the solution you proposed in your code may produce errors. Your use of the write
function makes sense. However, interrupting the sampler midway may lead to other problems if important function calls are interrupted in the middle. So let me know if n_like_max
solves your problem.
I'm not sure if n_like_max
solves my problem — after all, I do want to run nautilus until convergence (albeit over multiple jobs), and I don't know a priori how many likelihood evaluations will be needed for that. Also, as far as I can tell, the run()
method does not seem to return anything that indicates if the function finished because it actually converged or because the limit was reached?
I have already tried my luck at a first implementation of the max_runtime
argument, but I might have very well overlooked something or taken an approach that's too simplistic...
Yes, I think it may be good for nautilus to return whether the run finished because the maximum number of likelihood calls (or time) was reached or because it finished normally. In the meantime, I think n_like_max
would still solve your problem. Let's say you can run 1000 calls for each job. You can then make nautilus always run those 1000 calls for each job by specifying n_like_max=sampler.n_like+1000
. Additionally, you'd know that nautilus is finished if sampler.n_like
did not increase after calling run
. Does that make sense?
That does make sense, and I think I have managed to write a wrapper around the run()
call that implements this logic. My main concern right now is that I need to estimate the number of likelihood calls up front (because at the end of the day, I'm still bounded by total execution time, not the number of likelihood calls). I currently do this by evaluating the likelihood once, measuring the time, dividing my target maximum runtime by this estimate, and multiplying by the size of my Pool
. This seems like a rather crude estimate for the number of likelihood calls per job, though, because it ignores all the overhead?
Another idea I had was to do something like this:
finished = False
while time.time() - start_time < max_runtime:
before = sampler.n_like
sampler.run(..., n_like_max=sampler.n_like + 100) # or another small-ish number
if sampler.n_like == before:
finished = True
break
but I am not sure how much overhead it would introduce to stop and re-start the sampler so often? (I assume that every time run()
finishes, a checkpoint is created, and, of course, I would rather not spend more time creating checkpoints than actually sampling...)
I think what you wrote will probably work. I'll look into implementing the new features in the next week or so. It shouldn't be too difficult and I can see how this would make things easier in certain situations.
Hi @timothygebhard, I implemented the requested functionality in the timeout
branch. Please let me know if this fits your needs. I'll then add a couple of units tests, merge into main
and release this as part of version 1.0.3.
Hi @johannesulf! I just wanted to let you know that I've been running inference with the updated version for the past couple of days now and the timeout
feature is working very nicely so far 🥳 Thanks a lot for implementing it so quickly!
Wonderful! Please don't hesitate to reach out if you have any further questions or suggestions.
Hi Johannes!
First things first, thanks a lot for your work — nautilus is undoubtedly my favorite nested sampling library! 🙂
I have one request / suggestion / question: Due to the way the cluster works on which I use nautilus, I need to be able to limit the maximum runtime of
run()
and then possibly restart the job from a checkpoint. So far, I have been using an approach inspired by this solution, but I noticed that this is a bit too radical: If the timeout is reached while nautilus is writing data to an HDF file, I end up with a corrupted checkpoint from which I cannot resume (usually because of some shape mismatch). Therefore, I was wondering if you could imagine adding amax_runtime
argument to therun
method that would allow a more graceful way of limiting the runtime? I'd be fine with this being a "soft" limit that is only enforced whenever a checkpoint is created.Alternatively, it would probably help me out already if I could manually create a checkpoint after interrupting the
run()
call. Is it sufficient if I do something like this, or will this lose information / still produce corrupted checkpoints?I noticed that there is also
write_shell_update()
, and I wasn't sure what to do about that because theshell
information seems not that easily accessible after killingrun()
...Any other idea would of course be welcome, too!
Thanks a lot in advance, — Timothy