Open iancrossfield opened 4 years ago
This is a confirmed bug that I have encountered several times myself when running this on our shared physics department servers but have never encountered on my laptop. It is a moderately reproducible bug (often failing repeatedly for the same analysis, but not after the same amount of time). It isn't a very common bug, at least for me though - I only get it on the occasional decorrelation method and only for some data sets. The bug was introduced when we started using parallel computing within emcee and can likely be eliminated by reverting to just a single threaded emcee, but this comes with a large speed loss (at least a factor of 2).
Debugging this is tough since its hard to track the state of the parallel emcee walkers to understand what goes wrong. My suspicion is that this has something more to do with the emcee and multiprocessing packages than with SPCA, but it'll take some time to work this out.
For now, you could just run the MCMC replacing the sections that look like this:
with threadpool_limits(limits=1, user_api='blas'):
with Pool(ncpu) as pool:
#sampler
sampler = emcee.EnsembleSampler(nwalkers, ndim, templnprob, a = 2, pool=pool)
pos1, prob, state = sampler.run_mcmc(pos0, np.rint(nBurnInSteps2/nwalkers), progress=True)
with code that looks like this:
sampler = emcee.EnsembleSampler(nwalkers, ndim, templnprob, a = 2)
pos1, prob, state = sampler.run_mcmc(pos0, np.rint(nBurnInSteps2/nwalkers)
Make sure to do that for both the second burn-in and the production run if the MCMC fails in both places.
Actually, does it still hang for you if you get rid of just the with threadpool_limits(limits=1, user_api='blas'):
line? Some tests I just did suggest it doesn't end up speeding up the code appreciably, and it'd be good to rule that part of the code out at least. I can't remember which phasecurves and which decorrelation models would hang for me, so it'd be helpful if you try deleting that line (and unindenting the few lines of code that follow it) and see if it still crashes for your dataset. If it does, then at least I'll have a better idea of where to look for bugs
With two separate data sets, the Poly-BLISS-GP code hangs during the MCMC phase. It gets through burn-in, gets part of the way through the full MCMC, but then hangs at 60-90% of the way through. Maybe there's a validated/sample data set that one could test things on first, before moving to untested data?
Here's what I get:
(and the code hung there for ~24 hours or so). 1 CPU core continued to work at 100%, but the progress bar never advanced again.