Sampling slows down / resets during sampling phase

jacob-hjortlund commented 1 year ago

I'm currently using the Nautilus sampler via PyAutoFit (https://github.com/rhayes777/PyAutoFit/tree/feature/nautilus_w_sneakierpool) using a custom pool built around the Schwimmbad MPIPool. Currently using this to rerun some results for my thesis (https://github.com/jacob-hjortlund/BayeSNova/tree/feature/refactor_w_autofit)

I have had pretty good results so far, but run into a new issue during the sampling phase where the sampler slows down to unreasonable times (~600 hrs for 5000 effective samples). I am currently running autofit_test.py with n_live = n_batch = number_of_cores = 512, n_eff=5000 and split_threshold=10. The exploration phase seems to progress as expected, and once we reach the sampling phase, shells are filled within ~ 4 minutes. When sampling of the posterior begins, sampling initially progresses quickly, going from n_eff=153 to n_eff=822 within ~2 minutes. After this, however, n_eff seems to drop back to 1 and the expected runtime explodes from a few minutes to hundreds of hours.

Looking back, this may not be the most clear explanation. I guess I'm at a bit of a loss wrt. what is causing this. Is it expected behaviour that n_eff suddenly drops like this? As far as I understand the definition in your paper, n_eff should be monotonically increasing as we add samples, but this definitely doesn't seem to be the case here. Could it be due to n_live = n_batch? I don't necessarily see a reason why this should cause any issues though.

I've added an excerpt below from the SLURM output file during the sampling phase to illustrate what I tried to describe.

Sampling shells:     done
N_like:            151552
N_eff:                153
log Z:          29252.903

Sampling posterior:   3%|▎         153/5000 [00:00<?, ?it/s]
Sampling posterior:   6%|▋         321/5000 [00:02<01:15, 62.03it/s]
Sampling posterior:   8%|▊         399/5000 [00:05<01:49, 41.86it/s]
Sampling posterior:   3%|▎         169/5000 [00:11<03:27, 23.23it/s]
Sampling posterior:   5%|▍         235/5000 [00:13<03:24, 23.36it/s]
Sampling posterior:   6%|▌         285/5000 [00:16<03:39, 21.44it/s]
Sampling posterior:   6%|▋         324/5000 [00:19<04:00, 19.42it/s]
Sampling posterior:   7%|▋         338/5000 [00:22<05:06, 15.20it/s]
Sampling posterior:   7%|▋         369/5000 [00:24<05:32, 13.91it/s]
Sampling posterior:   8%|▊         387/5000 [00:27<06:31, 11.79it/s]
Sampling posterior:   8%|▊         403/5000 [00:30<07:38, 10.02it/s]
Sampling posterior:   9%|▉         446/5000 [00:33<06:25, 11.80it/s]
Sampling posterior:   9%|▉         462/5000 [00:35<07:35,  9.97it/s]
Sampling posterior:  10%|▉         481/5000 [00:38<08:13,  9.15it/s]
Sampling posterior:  10%|▉         493/5000 [00:41<09:41,  7.75it/s]
Sampling posterior:  11%|█         541/5000 [00:43<06:58, 10.67it/s]
Sampling posterior:  11%|█▏        564/5000 [00:49<08:06,  9.13it/s]
Sampling posterior:  12%|█▏        605/5000 [00:52<06:57, 10.53it/s]
Sampling posterior:  12%|█▏        620/5000 [00:55<07:58,  9.15it/s]
Sampling posterior:  13%|█▎        639/5000 [00:57<08:27,  8.60it/s]
Sampling posterior:  13%|█▎        670/5000 [01:00<07:41,  9.39it/s]
Sampling posterior:  14%|█▍        716/5000 [01:03<06:14, 11.44it/s]
Sampling posterior:  15%|█▍        741/5000 [01:06<06:32, 10.85it/s]
Sampling posterior:  15%|█▌        753/5000 [01:08<07:51,  9.00it/s]
Sampling posterior:  16%|█▌        795/5000 [01:11<06:27, 10.85it/s]
Sampling posterior:  14%|█▍        704/5000 [01:16<06:51, 10.43it/s]
Sampling posterior:  15%|█▌        756/5000 [01:19<05:37, 12.56it/s]
Sampling posterior:  16%|█▌        785/5000 [01:22<05:45, 12.21it/s]
Sampling posterior:  16%|█▋        822/5000 [01:24<05:32, 12.57it/s]
Sampling posterior:   0%|          1/5000 [01:41<06:37, 12.57it/s]  
Sampling posterior:   0%|          2/5000 [02:13<51:58,  1.60it/s]
Sampling posterior:   0%|          2/5000 [02:15<55:30,  1.50it/s]
Sampling posterior:   0%|          3/5000 [03:01<2:20:23,  1.69s/it]
Sampling posterior:   0%|          3/5000 [03:04<2:26:59,  1.77s/it]
Sampling posterior:   0%|          4/5000 [03:25<3:41:18,  2.66s/it]
Sampling posterior:   0%|          4/5000 [03:28<3:53:54,  2.81s/it]
Sampling posterior:   0%|          5/5000 [03:41<5:19:11,  3.83s/it]
Sampling posterior:   0%|          5/5000 [03:44<5:42:50,  4.12s/it]
Sampling posterior:   0%|          5/5000 [03:55<7:45:45,  5.59s/it]
Sampling posterior:   0%|          5/5000 [03:57<8:25:17,  6.07s/it]
...

johannesulf commented 1 year ago

Thanks for filing the issue. What you're describing is likely the correct behavior. As you enter the sampling phase, the sampler may find new high-likelihood parameter combinations in regions previously sampled at very low density. Such points will receive very large weights since the weight is basically the ratio of likelihood to sampling density. So it's not necessarily true that n_eff will decrease monotonically in the sampling phase. Instead, it may suddenly drop to 1 if it finds a new point with a much higher weight than previous points.

That you suddenly find high-likelihood points in low-sampling regions indicates that the initial exploration phase wasn't thorough enough. My number 1 suggestion would be to increase the number of live points. Could you tell me more about your likelihood problem? For example, what's the number of model parameters?

jacob-hjortlund commented 1 year ago

Thank you for the quick reply, a real life saver when a deadline is breathing down my neck. If I understand correctly, during the sampling phase, sampling is performed within the shells defined during the exploration phase. In this case where n_live is low, regions of high likelihood may have been missed, which are then discovered within the shells due to the large no. of samples being drawn. This then leads to the large drops, as you mention. Is that correctly understood?

I'm currently working with a hierarchical model that consists of a two-component 3D Gaussian mixture, that contains a numerical marginalization over a Gamma-distribution prior on one of the latent parameters. The basic model is described here, although I'm working on extensions that increase the dimensionality. In the runs where I encountered this issue the model dimensionality was N_dim=20. I've earlier had good results using n_live=4096, but ran into cases where sampling would freeze for hours on end when adding a new bound. This happened around N_bound > 80, where the enclosed volume was low, around logV ~ -70. As far as I could figure out, this issue may have been due to a mismatch between the ellipsoids and the region of high-likelihood learnt by the regressor, leading to extremely few samples being accepted. This is why I started setting n_live and split_threshold to lower values.

johannesulf commented 1 year ago

Yes, your understanding regarding the drops in n_eff is correct. Can you maybe describe the problem with the n_live=4096 run in more detail? Did it freeze when adding a new bound, i.e., before starting to sample the shell? This step may take longer but not hours. Could you send me the checkpoint file for such a run when it freezes? I could then check exactly what's going on.

johannesulf commented 1 year ago

Also, do you use Monte Carlo to perform the numerical marginalization over the Gamma distribution? In other words, is your likelihood noisy?

jacob-hjortlund commented 1 year ago

Sorry I wasn't clear enough, but you understood correctly. It seemingly freezes when adding a new bound, before sampling the shell. You can find a checkpoint file here where I ran a simpler version of the model with (as far as I known, logging isn't currently up to standard) n_live=4096, n_batch=1024 and remaining settings at default values. In this case the sampler froze for > 1 hr while adding the 82nd bound.

jacob-hjortlund commented 1 year ago

Also, do you use Monte Carlo to perform the numerical marginalization over the Gamma distribution? In other words, is your likelihood noisy?

Missed this comment. No, I use quadpack to numerically marginalize the gamma distribution, so the likelihood should not be noisy.

johannesulf commented 1 year ago

@jacob-hjortlund Thanks for all the input. The original issue you described here is not actually a problem. So I'll close this. But I opened a new issue https://github.com/johannesulf/nautilus/issues/28 that deals with the problem you encountered with n_live=4096.

johannesulf / nautilus

Sampling slows down / resets during sampling phase #26