pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
8.47k stars 1.97k forks source link

ConnectionResetError from multiprocess sampling #7354

Open fonnesbeck opened 2 weeks ago

fonnesbeck commented 2 weeks ago

Describe the issue:

This has come up in the past (#6852, #4167) and has now started cropping up again. Multiprocess sampling will fail sometime during sampling with a ConnectionResetError. Most recently, it has been happening to me on Linux (Fedora).

A workaround is to simply change the random number seed of the sampler, and it usually runs.

Details below.

Reproduceable code example:

Seems to be stochastic, so hard to reproduce.

Error message:

ConnectionResetError                      Traceback (most recent call last)
Cell In[28], [line 2](vscode-notebook-cell:?execution_count=28&line=2)
      [1](vscode-notebook-cell:?execution_count=28&line=1) with ad_spend_model:
----> [2](vscode-notebook-cell:?execution_count=28&line=2)     ptrace = pm.sample(100, chains=6, cores=4, random_seed=random_seed)

File ~/repos/pymc/pymc/sampling/, in sample(draws, tune, chains, cores, random_seed, progressbar, progressbar_theme, step, var_names, nuts_sampler, initvals, init, jitter_max_retries, n_init, trace, discard_tuned_samples, compute_convergence_checks, keep_warning_stat, return_inferencedata, idata_kwargs, nuts_sampler_kwargs, callback, mp_ctx, blas_cores, model, **kwargs)
    [839]( _print_step_hierarchy(step)
    [840]( try:
--> [841](     _mp_sample(**sample_args, **parallel_args)
    [842]( except pickle.PickleError:
    [843](     _log.warning("Could not pickle model, sampling singlethreaded.")

File ~/repos/pymc/pymc/sampling/, in _mp_sample(draws, tune, step, chains, cores, random_seed, start, progressbar, progressbar_theme, traces, model, callback, blas_cores, mp_ctx, **kwargs)
   [1252]( try:
   [1253](     with sampler:
-> [1254](         for draw in sampler:
   [1255](             strace = traces[draw.chain]
   [1256](             strace.record(draw.point, draw.stats)

File ~/repos/pymc/pymc/sampling/, in ParallelSampler.__iter__(self)
    [464]( task = progress.add_task(
    [465](     self._desc.format(self),
    [466](     completed=self._completed_draws,
    [467](     total=self._total_draws,
    [468]( )
    [470]( while self._active:
--> [471](     draw = ProcessAdapter.recv_draw(self._active)
    [472](     proc, is_last, draw, tuning, stats = draw
    [473](     self._completed_draws += 1

File ~/repos/pymc/pymc/sampling/, in ProcessAdapter.recv_draw(processes, timeout)
    [326]( idxs = {id(proc._msg_pipe): proc for proc in processes}
    [327]( proc = idxs[id(ready[0])]
--> [328]( msg = ready[0].recv()
    [330]( if msg[0] == "error":
    [331](     old_error = msg[1]

File ~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/, in _ConnectionBase.recv(self)
    [248]( self._check_closed()
    [249]( self._check_readable()
--> [250]( buf = self._recv_bytes()
    [251]( return _ForkingPickler.loads(buf.getbuffer())

File ~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/, in Connection._recv_bytes(self, maxsize)
    [429]( def _recv_bytes(self, maxsize=None):
--> [430](     buf = self._recv(4)
    [431](     size, = struct.unpack("!i", buf.getvalue())
    [432](     if size == -1:

File ~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/, in Connection._recv(self, size, read)
    [393]( remaining = size
    [394]( while remaining > 0:
--> [395](     chunk = read(handle, remaining)
    [396](     n = len(chunk)
    [397](     if n == 0:

ConnectionResetError: [Errno 104] Connection reset by peer

### PyMC version information:

Python version       : 3.12.3
pymc      : 5.15.1+17.g508a1341f.dirty
pytensor  : 2.22.1

### Context for the issue:

_No response_