joshspeagle / dynesty

Dynamic Nested Sampling package for computing Bayesian posteriors and evidences
https://dynesty.readthedocs.io/
MIT License
347 stars 76 forks source link

RuntimeError: Random walk sampling appears to be stuck! #304

Closed bjnorfolk closed 3 years ago

bjnorfolk commented 3 years ago

Hi,

I'm attempting to get dynesty to converge "nicely". My model has 19 parameters and I run it with:

ndim = 19
dsampler = dynesty.DynamicNestedSampler(ln_like, prior_transform, ndim = ndim,bound='single',sample='rwalk',walks=50)

dsampler.run_nested(dlogz_init=1e-10, nlive_init=4000, nlive_batch=4000,wt_kwargs={'pfrac': 0.95}) 

Is setting dlogz_init=1e-10 an unrealistic value?

I'm not too sure how to get dynesty to always converge e.g. for mcmc I'd just increase the walkers and steps but I've found for dynesty that I need to alter the scale of the log likelihood function - where logl = non-normalised chi2 - otherwise the code runs for a relatively minimal time and the convergence isn't great. If I leave the logl as just the non-normalised chi2 however, it prints the Runtime error. Any tips would be greatly appreciated.

Here's the full print out (since the code says it may be useful)

RuntimeError: Random walk sampling appears to be stuck! Some useful output quantities:
u: [0.61211578 0.29883424 0.30418985 0.26553322 0.77596979 0.12880671
 0.53166602 0.55272972 0.59784004 0.47026351 0.49768702 0.46909061
 0.0492889  0.6416871  0.47095095 0.39649497 0.56340283 0.58003664
 0.43703319]
drhat: [ 0.51942188  0.20292128 -0.25131422 -0.0462435  -0.04087769 -0.03899258
 -0.28354396 -0.2750504  -0.45671211  0.22560593  0.04020417 -0.13088042
  0.05010286  0.10831015  0.06636851 -0.0941852   0.0129034   0.33036297
 -0.22233308]
dr: [ 0.50421269  0.19697954 -0.24395549 -0.04488944 -0.03968075 -0.03785084
 -0.27524151 -0.26699665 -0.44333913  0.21899996  0.03902695 -0.12704811
  0.0486358   0.10513872  0.06442517 -0.09142736  0.01252557  0.32068962
 -0.21582295]
du: [ 6.34458514e-03  1.41546553e-04 -3.43680742e-04 -2.97290180e-04
 -2.27000337e-03 -1.17278180e-03 -1.36977299e-03 -7.22953104e-05
 -2.04123144e-04  4.20805157e-04 -4.65647849e-05 -1.46280561e-04
 -2.21868830e-03  1.99202809e-03 -7.68809013e-04  1.79377016e-03
  8.00802805e-05  1.15701526e-03 -8.07340036e-04]
u_prop: [0.61211578 0.29883424 0.30418985 0.26553322 0.77596979 0.12880671
 0.53166602 0.55272972 0.59784004 0.47026351 0.49768702 0.46909061
 0.0492889  0.6416871  0.47095095 0.39649497 0.56340283 0.58003664
 0.43703319]
loglstar: -989554.5
logl_prop: -989554.8125
axes: [[ 1.25831524e-02  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-9.59494844e-06  7.43145437e-04  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 3.71301087e-05 -1.15653354e-03  5.51693936e-04  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 6.94499246e-06 -1.69150355e-04  8.73413246e-04  1.21183982e-03
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 2.73957906e-04 -8.60692885e-03  2.74503412e-03 -8.04795262e-06
   1.09480371e-03  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 1.24898412e-04 -5.24219719e-03  7.02620596e-04 -3.72378230e-04
   5.08216528e-04  7.47482506e-04  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-3.19867398e-04 -1.66256241e-03 -4.87165649e-04  6.37138555e-04
  -2.21943007e-03  4.31434478e-04  3.78934642e-03  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 6.99011461e-05  7.50939116e-05 -3.75765342e-04 -5.11191945e-04
   8.43199529e-05 -4.94147094e-05 -3.04046425e-04  1.19537144e-03
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 2.09550450e-05  3.67508706e-05 -3.12002971e-05 -1.24150632e-04
   1.43859932e-04  8.81744894e-05 -2.82908663e-04  1.03809279e-04
   6.23040387e-04  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 3.05516274e-05  2.70754057e-04  1.29882321e-04 -1.00638958e-04
   2.85437772e-04  7.13926851e-07 -5.20945605e-04 -1.08674654e-04
  -1.27351404e-04  7.38481547e-04  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-1.38433571e-04 -3.04582253e-05 -1.36178139e-05 -2.89814475e-05
   7.80309440e-05 -3.20340557e-05  3.34764725e-05 -1.31236383e-05
  -7.06886466e-06 -3.81492237e-05  9.58990041e-04  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-1.92016931e-05 -7.64026977e-06  6.12720171e-05 -1.42674137e-04
  -2.05555985e-05  3.68160631e-05 -7.12946160e-06  3.92718877e-05
   2.26589176e-05  3.33146619e-06  3.96333729e-04  9.72871403e-04
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-4.41093062e-03  1.89035515e-05  4.67148659e-06 -3.70515645e-06
  -1.84964955e-05 -3.39352322e-06  1.38457554e-05  3.16455044e-06
   8.71002817e-06  8.37528244e-06 -1.22932324e-05  7.23592550e-08
   1.83369330e-04  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-3.35641371e-04  6.84401918e-03 -3.36150827e-03  4.37332517e-04
  -8.62379949e-04  1.61894799e-04  3.51631144e-04 -6.95863825e-05
   1.58963842e-04  1.01999644e-04  6.54033393e-05  3.89800472e-05
   1.73473518e-05  1.07010444e-03  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 1.77472187e-04 -4.75235561e-03 -3.20157828e-04 -3.03513420e-04
   1.04439265e-03  1.67462704e-04 -9.77002001e-05 -1.03067467e-05
   4.51667620e-05  1.45626808e-04  2.11614695e-05 -4.69488571e-05
  -6.59153625e-05 -5.98547171e-04  8.02648703e-04  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 9.21585938e-05  1.20520033e-03 -6.63482858e-03 -7.46695507e-03
   1.98472258e-04  5.08495839e-05  3.68956297e-04  3.08721385e-04
  -2.93650482e-04 -3.06530234e-04  5.83505785e-05  1.67152074e-05
   8.41808224e-05 -2.35639096e-04 -8.68709842e-04  2.58861780e-03
   0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [-2.35606757e-04 -1.01496058e-03  2.06209330e-04  3.83653812e-04
  -5.04916312e-04 -2.67252253e-03 -5.29116257e-04 -1.79550685e-04
  -2.60915858e-04 -3.75345682e-04 -1.27468841e-04 -1.57467145e-04
   7.57737881e-05  1.41913802e-04 -1.70889529e-03 -1.76620976e-03
   2.64012072e-03  0.00000000e+00  0.00000000e+00]
 [ 2.60843480e-04 -4.84038713e-05  7.51183672e-04  3.27388573e-04
   5.05505263e-04 -5.59759483e-04 -1.10650854e-03 -6.76308608e-05
   3.86106624e-05 -3.05519116e-04 -9.08987536e-05 -6.56174791e-05
  -1.32180212e-04 -8.39534513e-05 -1.36644947e-04 -1.06686592e-03
   1.91169322e-04  2.84573877e-03  0.00000000e+00]
 [-1.11306400e-04 -2.84578527e-04 -3.70878506e-04 -5.95121814e-05
   7.12390082e-04  8.54350560e-04  1.62642821e-03 -1.43116178e-04
   2.63919121e-04  3.27381262e-04  7.70220025e-05  1.31537510e-04
   1.39954314e-04  1.00476004e-03 -2.31798803e-04  4.33121317e-04
   1.87592818e-04 -1.22857147e-05  1.46214361e-03]]
scale: 6.699929041550063e-13.
joshspeagle commented 3 years ago

Is setting dlogz_init=1e-10 an unrealistic value?

Oh goodness, that's really small. dynesty's default convergence criterion for the first initial run and the use of dlogz in general is to try and estimate the remaining fraction of the evidence that needs to be integrated over. The starting value of 0.01 means the remainder is <1% (and will in practice be less once the live points are "recycled" at the end of a batch/run), so 1e-10 is saying you want to estimate the integral until only <1e-8% remains before you terminate sampling. That leads to runtimes honing in on such a small area that you end up hitting numerical precision and sampling issues (notice that scale is 6.7e-13 when it should be of order ~0.1-ish). You shouldn't need to adjust dlogz to be any smaller than 1e-3 or 1e-4, and even that's if you want your evidence estimates to be really precise.

I'm not too sure how to get dynesty to always converge e.g. for mcmc I'd just increase the walkers and steps but I've found for dynesty that I need to alter the scale of the log-likelihood function - where logl = non-normalised chi2 - otherwise the code runs for a relatively minimal time and the convergence isn't great.

So a couple of things here. "Convergence" in nested sampling means different things from MCMC, and both aren't necessarily always well-defined by practitioners. Nested sampling is interested in estimating the integral (the posterior is a by-product), so "convergence" means you estimate this to some precision. Sometimes you can get excellent estimates of the evidence without very "thorough" sampling of the posterior, which is often what MCMC users think "convergence" means.

Alternately, if you mean it doesn't find the right solution or converges haphazardly or something else entirely, that might indicate other problems and is worth following up on in more detail (posting some trace plots, etc.). If you just want better-sampled posteriors, you can also play around with the stopping criteria in the Dynamic Nested Sampler (e.g., setting an effective sample size based on your MCMC runs after adjusting for the auto-correlation times).

The second is that if by altering the scale of your log-likelihood function you mean something like:

log-like = s (-0.5 chi2), where s is a scale factor

this is not the right thing to do. This is equivalent to raising your likelihood function to some power, which can substantially depart from the assumptions you've made when computing chi2 (e.g., the data is normally distributed relative to your model). If you mean something else, it would be good to get that assumption in more detail to better understand why/how it could impact the performance of the sampler. Thanks!

bjnorfolk commented 3 years ago

Thanks for the reply @joshspeagle

I should have elaborated further on what I meant by that scaler comment. I've found that for a non-normalised chi2, dynesty will hit that RuntimeError. If I normalise the chi2 (which for my data is effectively dividing the chi2 by 1e7), the run time for dynesty is very small but the sampler doesn't appear to converge on a minimum solution for all parameters. If I instead divide by some arbitrary power between the non-normalised and normalised power range (say 1e3), dynesty appears to converge nicely. But the closer I let the chi2 go to the magnitude of the non-normalised chi2 value, the longer the code runs and the more likely I am to hit that RuntimeError. This is where I began to incorrectly play with dlogz. Is there a specific form of the chi2 that dynesty works best with? Or is there a some way I can change stopping criteria such that dynesty will run for longer? Or is there a fix to the RuntimeError so I can just use the non-normalised chi2?

I've included two result plots, the first is for the normalised chi2 (effectively /1e7) the second is dividing by 1e3 (runtime 15min vs 55min). As you can see for the run closer to the non-normalised chi2 the convergence is better.

For clarification, I'm trying to fit the visibilities from radio observations of protoplanetary discs where I'm calculating the non-normalised chi2 as chi2 = sum(weights*(Vis_data - Vis_model)**2). For the data I look at there's typically 1e6 to 1e7 points.

hd100546_dynasty_tf_nwlaks50_nlive1000_npoints_dynesty_results

hd100546_dynasty_tf_nwlaks50_nlive1000_1e3_dynesty_results

segasai commented 3 years ago

@bjnorfolk Dividing the chi-squares by a number is certainly not a right thing to do (at least if you care about your actual posterior). Regarding your actual problem, the first suggestion I have is to use rslice as it was recently significantly improved and I think it is extremely stable. And the reason why you have this particular problem is likely due to this https://github.com/joshspeagle/dynesty/pull/258 -- the fix was applied to slice sampling that ensures that the starting points always satisfy the logL>threshold condition (because it's mandatory there). It wasn't applied to random walk and I guess that because of it you see an error similar to this https://github.com/joshspeagle/dynesty/issues/250

bjnorfolk commented 3 years ago

Hi @segasai thanks for the suggestion. I tried rslice with

dsampler = dynesty.DynamicNestedSampler(ln_like, prior_transform, ndim = ndim,bound='single',sample='rslice')
dsampler.run_nested(dlogz_init=0.001, nlive_init=n_live, nlive_batch=n_live,wt_kwargs={'pfrac': 0.95}) 

and got a similar error. (I'm on the latest version of dynesty via cloning and installing from the repo)

RuntimeError: Slice sampler has failed to find a valid point. Some useful output quantities:
u: [0.46943746 0.30621906 0.29782424 0.27017212 0.13160526 0.78594993
 0.57186066 0.53864133 0.57130389 0.47519395 0.13235085 0.4728272
 0.63396445 0.377243   0.56169056 0.56892499 0.44165118]
nstep_left: -1.73e-322
nstep_right: 5e-324
nstep_hat: 1.8e-322
u_prop: [0.46943746 0.30621906 0.29782424 0.27017212 0.13160526 0.78594993
 0.57186066 0.53864133 0.57130389 0.47519395 0.13235085 0.4728272
 0.63396445 0.377243   0.56169056 0.56892499 0.44165118]
loglstar: -10108963.0
logl_prop: -10108965.0
direction: [ 6.91938696e-04  9.45519109e-05 -9.91989602e-06  1.11198671e-04
  1.13267978e-04  4.61693733e-04 -3.04627822e-04  2.53959199e-04
 -4.94087731e-05 -2.16770074e-04 -6.90012501e-04 -5.04920161e-05
 -6.79578316e-04 -1.07719862e-03  1.48433833e-03  3.04994361e-04
 -4.42449724e-04]
segasai commented 3 years ago

Unless there is some bug that I'm not seeing, it seems to me that you have an extremely discontinuous/(or noisy) likelihood function because your function seems to change in logl by 2 (10108963.0 vs 10108965.0 ) when shifted by a machine precision. I would run the code while printing the likelihood values and values of parameters and check whether that makes sense.

bjnorfolk commented 3 years ago

I'm confident in my log likelihood function as it follows identical methods seen in previous literature. For long baseline interferometric observations the visibilities will differ from a model (for unbinned data) for a significant portion of the data. As you can see in the image below (where orange is my model and blue is my data), beyond a certain point there will always be some non-zero chi2 value from that data point.

image

This is why I was curious what form of the logl is best for dynesty? The non-normalised chi2 will always be large depending on the baseline range I have in my observations and how good the uv-coverage is at these long baselines - and as you can see dynesty does somewhat converge for an altered logl.

segasai commented 3 years ago

I don't do interferometry, so I can't comment on that aspect, but what the error message indicates to me there is likely a discontiunity in the logl() function (i.e. tiny change in parameters leads to significant change in logl); and that's potentially quite problematic (depending on details).
I would check the behaviour of logl around the point where dynesty gets stuck.
Regarding 'form of logl' you should use whatever form is determined by your statistical model.

segasai commented 3 years ago

Okay, I've read your previous posts in detail And it's seems your likelihood f-n needs to be something like sum(-0.5*((data-model)^2/err^2 -ln(err))
where err is an extra-parameter for the noise of your data. (that is making an assumption that all your measurements are independent and share the same uncertainty.

bjnorfolk commented 3 years ago

@segasai thanks for the suggestion, I went with something similar and the code now nicely converges. Thanks!