Poly-BLISS-GP hangs during MCMC

iancrossfield commented 4 years ago

With two separate data sets, the Poly-BLISS-GP code hangs during the MCMC phase. It gets through burn-in, gets part of the way through the full MCMC, but then hangs at 60-90% of the way through. Maybe there's a validated/sample data set that one could test things on first, before moving to untested data?

Here's what I get:

In [2]: run spca_poly-bliss.py /dash/exobox/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Warning: time_avg from mc3.stats failed to import. To install it, run the getMCcubed.sh script. Downloading relevant PHOENIX wavelengths file... Done download. Downloading relevant PHOENIX spectra... Done download. Beginning ltt9779_ch1 ch1 Poly2_v1_autoRun Running first burn-in 100%|█████████████████████████████████████████| 667/667 [00:48<00:00, 13.89it/s] Mean burn-in acceptance fraction: 0.298 Running second burn-in 100%|███████████████████████████████████████| 6667/6667 [09:44<00:00, 11.70it/s] Mean burn-in acceptance fraction: 0.262 MCMC runtime = 10.06 min

Running production 63%|█████████████████████████ | 837/1333 [01:30<00:44, 11.21it/s]

(and the code hung there for ~24 hours or so). 1 CPU core continued to work at 100%, but the progress bar never advanced again.

taylorbell57 commented 4 years ago

This is a confirmed bug that I have encountered several times myself when running this on our shared physics department servers but have never encountered on my laptop. It is a moderately reproducible bug (often failing repeatedly for the same analysis, but not after the same amount of time). It isn't a very common bug, at least for me though - I only get it on the occasional decorrelation method and only for some data sets. The bug was introduced when we started using parallel computing within emcee and can likely be eliminated by reverting to just a single threaded emcee, but this comes with a large speed loss (at least a factor of 2).

Debugging this is tough since its hard to track the state of the parallel emcee walkers to understand what goes wrong. My suspicion is that this has something more to do with the emcee and multiprocessing packages than with SPCA, but it'll take some time to work this out.

For now, you could just run the MCMC replacing the sections that look like this: with threadpool_limits(limits=1, user_api='blas'): with Pool(ncpu) as pool: #sampler sampler = emcee.EnsembleSampler(nwalkers, ndim, templnprob, a = 2, pool=pool) pos1, prob, state = sampler.run_mcmc(pos0, np.rint(nBurnInSteps2/nwalkers), progress=True)

with code that looks like this:

sampler = emcee.EnsembleSampler(nwalkers, ndim, templnprob, a = 2) pos1, prob, state = sampler.run_mcmc(pos0, np.rint(nBurnInSteps2/nwalkers)

Make sure to do that for both the second burn-in and the production run if the MCMC fails in both places.

taylorbell57 commented 4 years ago

Actually, does it still hang for you if you get rid of just the with threadpool_limits(limits=1, user_api='blas'): line? Some tests I just did suggest it doesn't end up speeding up the code appreciably, and it'd be good to rule that part of the code out at least. I can't remember which phasecurves and which decorrelation models would hang for me, so it'd be helpful if you try deleting that line (and unindenting the few lines of code that follow it) and see if it still crashes for your dataset. If it does, then at least I'll have a better idea of where to look for bugs

lisadang27 / SPCA

Poly-BLISS-GP hangs during MCMC #16