Closed jacob-hjortlund closed 1 year ago
Thanks for filing the issue. What you're describing is likely the correct behavior. As you enter the sampling phase, the sampler may find new high-likelihood parameter combinations in regions previously sampled at very low density. Such points will receive very large weights since the weight is basically the ratio of likelihood to sampling density. So it's not necessarily true that n_eff
will decrease monotonically in the sampling phase. Instead, it may suddenly drop to 1 if it finds a new point with a much higher weight than previous points.
That you suddenly find high-likelihood points in low-sampling regions indicates that the initial exploration phase wasn't thorough enough. My number 1 suggestion would be to increase the number of live points. Could you tell me more about your likelihood problem? For example, what's the number of model parameters?
Thank you for the quick reply, a real life saver when a deadline is breathing down my neck. If I understand correctly, during the sampling phase, sampling is performed within the shells defined during the exploration phase. In this case where n_live
is low, regions of high likelihood may have been missed, which are then discovered within the shells due to the large no. of samples being drawn. This then leads to the large drops, as you mention. Is that correctly understood?
I'm currently working with a hierarchical model that consists of a two-component 3D Gaussian mixture, that contains a numerical marginalization over a Gamma-distribution prior on one of the latent parameters. The basic model is described here, although I'm working on extensions that increase the dimensionality. In the runs where I encountered this issue the model dimensionality was N_dim=20
. I've earlier had good results using n_live=4096
, but ran into cases where sampling would freeze for hours on end when adding a new bound. This happened around N_bound > 80
, where the enclosed volume was low, around logV ~ -70
. As far as I could figure out, this issue may have been due to a mismatch between the ellipsoids and the region of high-likelihood learnt by the regressor, leading to extremely few samples being accepted. This is why I started setting n_live
and split_threshold
to lower values.
Yes, your understanding regarding the drops in n_eff
is correct. Can you maybe describe the problem with the n_live=4096
run in more detail? Did it freeze when adding a new bound, i.e., before starting to sample the shell? This step may take longer but not hours. Could you send me the checkpoint file for such a run when it freezes? I could then check exactly what's going on.
Also, do you use Monte Carlo to perform the numerical marginalization over the Gamma distribution? In other words, is your likelihood noisy?
Sorry I wasn't clear enough, but you understood correctly. It seemingly freezes when adding a new bound, before sampling the shell. You can find a checkpoint file here where I ran a simpler version of the model with (as far as I known, logging isn't currently up to standard) n_live=4096
, n_batch=1024
and remaining settings at default values. In this case the sampler froze for > 1 hr while adding the 82nd bound.
Also, do you use Monte Carlo to perform the numerical marginalization over the Gamma distribution? In other words, is your likelihood noisy?
Missed this comment. No, I use quadpack to numerically marginalize the gamma distribution, so the likelihood should not be noisy.
@jacob-hjortlund Thanks for all the input. The original issue you described here is not actually a problem. So I'll close this. But I opened a new issue https://github.com/johannesulf/nautilus/issues/28 that deals with the problem you encountered with n_live=4096
.
I'm currently using the Nautilus sampler via PyAutoFit (https://github.com/rhayes777/PyAutoFit/tree/feature/nautilus_w_sneakierpool) using a custom pool built around the Schwimmbad MPIPool. Currently using this to rerun some results for my thesis (https://github.com/jacob-hjortlund/BayeSNova/tree/feature/refactor_w_autofit)
I have had pretty good results so far, but run into a new issue during the sampling phase where the sampler slows down to unreasonable times (~600 hrs for 5000 effective samples). I am currently running
autofit_test.py
withn_live = n_batch = number_of_cores = 512
,n_eff=5000
andsplit_threshold=10
. The exploration phase seems to progress as expected, and once we reach the sampling phase, shells are filled within ~ 4 minutes. When sampling of the posterior begins, sampling initially progresses quickly, going fromn_eff=153
ton_eff=822
within ~2 minutes. After this, however,n_eff
seems to drop back to 1 and the expected runtime explodes from a few minutes to hundreds of hours.Looking back, this may not be the most clear explanation. I guess I'm at a bit of a loss wrt. what is causing this. Is it expected behaviour that
n_eff
suddenly drops like this? As far as I understand the definition in your paper,n_eff
should be monotonically increasing as we add samples, but this definitely doesn't seem to be the case here. Could it be due ton_live = n_batch
? I don't necessarily see a reason why this should cause any issues though.I've added an excerpt below from the SLURM output file during the sampling phase to illustrate what I tried to describe.