EOFError when multi-processing and orbit is variable

eas342 commented 1 year ago

Describe the bug I had an issue with sampling in starry on multiple cores when the orbit is variable. This had me running everything more slowly on one core for a while. The error traceback I got was:

Error traceback

--------------------------------------------------------------------------- EOFError Traceback (most recent call last) Input In [24], in () 1 with model: 2 #trace = pmx.sample( ----> 3 trace = pm.sample( 4 tune=250, 5 draws=500, 6 start=map_soln, 7 chains=4, 8 cores=4, 9 target_accept=0.9, 10 ) File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/sampling.py:559, in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs) 557 _print_step_hierarchy(step) 558 try: --> 559 trace = _mp_sample(**sample_args, **parallel_args) 560 except pickle.PickleError: 561 _log.warning("Could not pickle model, sampling singlethreaded.") File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/sampling.py:1477, in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, callback, discard_tuned_samples, mp_ctx, pickle_backend, **kwargs) 1475 try: 1476 with sampler: -> 1477 for draw in sampler: 1478 trace = traces[draw.chain - chain] 1479 if trace.supports_sampler_stats and draw.stats is not None: File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/parallel_sampling.py:479, in ParallelSampler.__iter__(self) 476 self._progress.update(self._total_draws) 478 while self._active: --> 479 draw = ProcessAdapter.recv_draw(self._active) 480 proc, is_last, draw, tuning, stats, warns = draw 481 self._total_draws += 1 File ~/miniconda3/envs/ABATE/lib/python3.9/site-packages/pymc3/parallel_sampling.py:351, in ProcessAdapter.recv_draw(processes, timeout) 349 idxs = {id(proc._msg_pipe): proc for proc in processes} 350 proc = idxs[id(ready[0])] --> 351 msg = ready[0].recv() 353 if msg[0] == "error": 354 warns, old_error = msg[1:] File ~/miniconda3/envs/ABATE/lib/python3.9/multiprocessing/connection.py:255, in _ConnectionBase.recv(self) 253 self._check_closed() 254 self._check_readable() --> 255 buf = self._recv_bytes() 256 return _ForkingPickler.loads(buf.getbuffer()) File ~/miniconda3/envs/ABATE/lib/python3.9/multiprocessing/connection.py:419, in Connection._recv_bytes(self, maxsize) 418 def _recv_bytes(self, maxsize=None): --> 419 buf = self._recv(4) 420 size, = struct.unpack("!i", buf.getvalue()) 421 if size == -1: File ~/miniconda3/envs/ABATE/lib/python3.9/multiprocessing/connection.py:388, in Connection._recv(self, size, read) 386 if n == 0: 387 if remaining == size: --> 388 raise EOFError 389 else: 390 raise OSError("got end of file during message") EOFError:

To Reproduce

Minimal-ish example adapted from the "Hot jupiter phase curve example"

import starry import matplotlib.pyplot as plt import numpy as np import pymc3 as pm import pymc3_ext as pmx import exoplanet starry.config.quiet = True np.random.seed(1) # In[7]: A = starry.Primary(starry.Map(ydeg=0, udeg=2, amp=1.0), m=1.0, r=1.0, prot=1.0) A.map[1] = 0.4 A.map[2] = 0.2 # In[8]: # These are the parameters we're going to try to infer log_amp_true = -3.0 offset_true = 30.0 b = starry.Secondary( starry.Map(ydeg=1, udeg=0, amp=10 ** log_amp_true, inc=90.0, obl=0.0), m=0.0, r=0.1, inc=90.0, prot=1.0, porb=1.0, ) b.map[1, 0] = 0.5 b.theta0 = 180.0 + offset_true # In[9]: sys = starry.System(A, b) # In[10]: t = np.linspace(-0.3, 1.3, 1000) flux_true = sys.flux(t).eval() ferr = 1e-4 flux = flux_true + ferr * np.random.randn(len(t)) plt.figure(figsize=(12, 5)) plt.plot(t, flux, "k.", alpha=0.3, ms=3) plt.plot(t, flux_true) plt.xlabel("Time [days]", fontsize=24) plt.ylabel("Flux [normalized]", fontsize=24); # In[21]: with pm.Model() as model: # These are the variables we're solving for; # here we're placing wide Gaussian priors on them. #offset = pm.Normal("offset", 0.0, 50.0, testval=0.11) offset=offset_true log_amp = pm.Normal("log_amp", -4.0, 2.0, testval=-3.91) porb = pm.Normal('porb',mu=1.0,sigma=0.02) #porb = 1.0 # Instantiate the star; all its parameters are assumed # to be known exactly A = starry.Primary( starry.Map(ydeg=0, udeg=2, amp=1.0, inc=90.0, obl=0.0), m=1.0, r=1.0, prot=1.0 ) A.map[1] = 0.4 A.map[2] = 0.2 # Instantiate the planet. Everything is fixed except for # its luminosity and the hot spot offset. b = starry.Secondary( starry.Map(ydeg=1, udeg=0, amp=10 ** log_amp, inc=90.0, obl=0.0), m=0.0, r=0.1, prot=1.0, porb=porb, ) b.map[1, 0] = 0.5 b.theta0 = 180.0 + offset # Instantiate the system as before sys = starry.System(A, b) # Our model for the flux flux_model = pm.Deterministic("flux_model", sys.flux(t)) # This is how we tell `pymc3` about our observations; # we are assuming they are ampally distributed about # the true model. This line effectively defines our # likelihood function. pm.Normal("obs", flux_model, sd=ferr, observed=flux) # In[22]: with model: map_soln = pmx.optimize() # In[23]: plt.figure(figsize=(12, 5)) plt.plot(t, flux, "k.", alpha=0.3, ms=3) plt.plot(t, map_soln["flux_model"]) plt.xlabel("Time [days]", fontsize=24) plt.ylabel("Flux [normalized]", fontsize=24); # In[24]: with model: trace = pm.sample( tune=250, draws=500, start=map_soln, chains=4, cores=4, target_accept=0.9, )

Expected behavior Should sample the posterior and calculate a trace object. Instead, I get the error.

Your setup (please complete the following information):

Version of starry: 1.2.0
Operating system: Mac OS X 12.0.1
Python version & installation method (pip, conda, etc.): Python 3.9.12, pip 21.2.4, conda 4.10.1, M1pro chip

Additional context This happened to me in a fairly specific set of circumstances:

when the number of cores > 1
when the orbital parameters are variable However, this is the set of circumstances in which I primarily use starry.

I have found a solution/workaround and wanted to share it in case anyone else gets EOFError.

eas342 commented 1 year ago

The error goes away if I change the sampling from pymc3 to pymc3_ext. I wish I'd noticed that sooner!

This gives an EOFError:

import pymc3_ext as pmx
import pymc3 as pm

trace = pm.sample(
    tune=250,
    draws=500,
    start=map_soln,
    chains=4,
    cores=4,
    target_accept=0.9,
)

This works with no problem. Notice the X in pmx!:

import pymc3_ext as pmx
import pymc3 as pm

trace = pmx.sample(
    tune=250,
    draws=500,
    start=map_soln,
    chains=4,
    cores=4,
    target_accept=0.9,
)

dfm commented 1 year ago

Good find! I expect this has something to do with these multiprocessing hacks in pymc3-ext:

https://github.com/exoplanet-dev/pymc-ext/blob/ef9bc12b8b16bd0991af70f23c59f1e32d9944be/src/pymc3_ext/sampling/sampling.py#LL100C1-L107C30

PyMC3 has always had some serious issues with multiprocessing on Macs, and I never loved this "solution", but it typically seems to work!

catrionamurray commented 1 year ago

I get this bug persisting on an old Mac (Intel 2019 OSx13.3.1) even when using pymc3_ext.sample for >1 core. I believe this isn't an issue for my M1 chip Mac.

starry v1.2.0 Python v3.9.7 (default, Sep 16 2021, 08:50:36) [Clang 10.0.0 ] numpy v1.21.6 chromatic v0.4.5 pymc3 v3.11.4 pymc3_ext v0.1.1

dfm commented 1 year ago

@catrionamurray — Bummer! I'm not too sure what to recommend and I don't have an Intel mac to test this locally. One option for the short term might be to just run multiple copies of your script (each with 1 CPU) and then combine the chains after... Unfortunately I'm about to go on leave and working at limited capacity so I can't be super helpful in the short term - sorry!!

catrionamurray commented 1 year ago

Actually it seems I still get this issue on my M1 chip, and changing to pmx.sample doesn't seem to solve it for me...

rodluger / starry

EOFError when multi-processing and orbit is variable #308