pymc-devs / pymc

Bayesian Modeling and Probabilistic Programming in Python
https://docs.pymc.io/
Other
8.47k stars 1.97k forks source link

ConnectionResetError from multiprocess sampling #7354

Open fonnesbeck opened 2 weeks ago

fonnesbeck commented 2 weeks ago

Describe the issue:

This has come up in the past (#6852, #4167) and has now started cropping up again. Multiprocess sampling will fail sometime during sampling with a ConnectionResetError. Most recently, it has been happening to me on Linux (Fedora).

A workaround is to simply change the random number seed of the sampler, and it usually runs.

Details below.

Reproduceable code example:

Seems to be stochastic, so hard to reproduce.

Error message:

---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
Cell In[28], [line 2](vscode-notebook-cell:?execution_count=28&line=2)
      [1](vscode-notebook-cell:?execution_count=28&line=1) with ad_spend_model:
----> [2](vscode-notebook-cell:?execution_count=28&line=2)     ptrace = pm.sample(100, chains=6, cores=4, random_seed=random_seed)

File ~/repos/pymc/pymc/sampling/mcmc.py:841, in sample(draws, tune, chains, cores, random_seed, progressbar, progressbar_theme, step, var_names, nuts_sampler, initvals, init, jitter_max_retries, n_init, trace, discard_tuned_samples, compute_convergence_checks, keep_warning_stat, return_inferencedata, idata_kwargs, nuts_sampler_kwargs, callback, mp_ctx, blas_cores, model, **kwargs)
    [839](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:839) _print_step_hierarchy(step)
    [840](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:840) try:
--> [841](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:841)     _mp_sample(**sample_args, **parallel_args)
    [842](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:842) except pickle.PickleError:
    [843](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:843)     _log.warning("Could not pickle model, sampling singlethreaded.")

File ~/repos/pymc/pymc/sampling/mcmc.py:1254, in _mp_sample(draws, tune, step, chains, cores, random_seed, start, progressbar, progressbar_theme, traces, model, callback, blas_cores, mp_ctx, **kwargs)
   [1252](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:1252) try:
   [1253](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:1253)     with sampler:
-> [1254](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:1254)         for draw in sampler:
   [1255](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:1255)             strace = traces[draw.chain]
   [1256](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/mcmc.py:1256)             strace.record(draw.point, draw.stats)

File ~/repos/pymc/pymc/sampling/parallel.py:471, in ParallelSampler.__iter__(self)
    [464](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:464) task = progress.add_task(
    [465](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:465)     self._desc.format(self),
    [466](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:466)     completed=self._completed_draws,
    [467](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:467)     total=self._total_draws,
    [468](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:468) )
    [470](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:470) while self._active:
--> [471](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:471)     draw = ProcessAdapter.recv_draw(self._active)
    [472](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:472)     proc, is_last, draw, tuning, stats = draw
    [473](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:473)     self._completed_draws += 1

File ~/repos/pymc/pymc/sampling/parallel.py:328, in ProcessAdapter.recv_draw(processes, timeout)
    [326](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:326) idxs = {id(proc._msg_pipe): proc for proc in processes}
    [327](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:327) proc = idxs[id(ready[0])]
--> [328](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:328) msg = ready[0].recv()
    [330](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:330) if msg[0] == "error":
    [331](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/repos/pymc/pymc/sampling/parallel.py:331)     old_error = msg[1]

File ~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:250, in _ConnectionBase.recv(self)
    [248](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:248) self._check_closed()
    [249](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:249) self._check_readable()
--> [250](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:250) buf = self._recv_bytes()
    [251](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:251) return _ForkingPickler.loads(buf.getbuffer())

File ~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:430, in Connection._recv_bytes(self, maxsize)
    [429](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:429) def _recv_bytes(self, maxsize=None):
--> [430](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:430)     buf = self._recv(4)
    [431](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:431)     size, = struct.unpack("!i", buf.getvalue())
    [432](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:432)     if size == -1:

File ~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:395, in Connection._recv(self, size, read)
    [393](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:393) remaining = size
    [394](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:394) while remaining > 0:
--> [395](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:395)     chunk = read(handle, remaining)
    [396](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:396)     n = len(chunk)
    [397](https://file+.vscode-resource.vscode-cdn.net/home/cfonnesbeck/repos/bayes_pydata_london_2024/~/miniforge3/envs/pymc_course/lib/python3.12/multiprocessing/connection.py:397)     if n == 0:

ConnectionResetError: [Errno 104] Connection reset by peer


### PyMC version information:

Python version       : 3.12.3
pymc      : 5.15.1+17.g508a1341f.dirty
pytensor  : 2.22.1

### Context for the issue:

_No response_