uqfoundation / pathos

parallel graph management and execution in heterogeneous computing
http://pathos.rtfd.io
Other
1.38k stars 89 forks source link

BrokenPipeError: [Errno 32] Broken pipe (in multiprocess) #206

Closed 3fon3fonov closed 3 years ago

3fon3fonov commented 3 years ago

I am getting serious troubles with some really expensive runs, and I think the problem could be in pathos, but I am not 100% sure. I am using the "dynesty" nested-sampling sampler to get prsteriors of really heavy model. like this:

from pathos.pools import ProcessPool as Pool

dynesty_samp = "rwalk"
print_progress = True
N_threads = 80
stop_crit = 0.01
Dynamic_nest = True
ns_bound = "multi"
ns_pfrac = 1.0
ns_use_stop = True
ndim = 30 # for example
nlive=1000

thread = Pool(ncpus=N_threads)

sampler = dynesty.DynamicNestedSampler(partial_func, prior_transform, ndim, pool = thread,
                                       queue_size=threads, sample = dynesty_samp, bound = ns_bound) # nlive=nwalkers,

sampler.run_nested(print_progress=print_progress,dlogz_init=stop_crit,nlive_init=nwalkers, 
maxiter = ns_maxiter, maxcall = ns_maxcall,use_stop = ns_use_stop, wt_kwargs={'pfrac': ns_pfrac})
thread.close()
thread.join()
thread.clear()

And over a week now, no three different machines I am hitting into this:

148982it [34:36:44,  4.26s/it, batch: 0 | bound: 3412 | nc: 25 | ncall: 6977684 | eff(%):  2.135 | loglstar:   -inf < 31646.855 <    inf | logz: 31367.278 +/-  1.009 | dlogz:  0.011 >  0.010] Bus error

[trifonov@node NS_2_1_dyn_GP]$ Process ForkPoolWorker-19:
Traceback (most recent call last):
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/pool.py", line 125, in worker
    put((job, i, result))
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/queues.py", line 350, in put
    self._writer.send_bytes(obj)
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/connection.py", line 203, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/connection.py", line 401, in _send_bytes
    self._send(buf)
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/connection.py", line 371, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/process.py", line 258, in _bootstrap
    self.run()
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/pool.py", line 130, in worker
    put((job, i, (False, wrapped)))
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/queues.py", line 350, in put
    self._writer.send_bytes(obj)
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/connection.py", line 203, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/connection.py", line 400, in _send_bytes
    self._send(header)
  File "/home/trifonov/.local/lib/python3.6/site-packages/multiprocess/connection.py", line 371, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

In the above example over 34.5 hours of running on a 80 CPU machine are lost! Shorter runs usually compleate with no problem.

Help is really needed and highly appreciated!

mmckerns commented 3 years ago

I haven't seen a BrokenPipeError yet, and there's nothing I see from the tracebacks that can help me. However, if I were you, I'd google for the error you are seeing: https://discourse.pymc.io/t/multiprocessing-windows-10-brokenpipeerror-errno-32-broken-pipe/2259. It seems that others have seen it in multiprocess and multiprocessing.

The code you posted produces the error... but takes 30+ hours to do so? Yikes, that's not conducive to testing. You don't see the behavior on any shorter runs?

3fon3fonov commented 3 years ago

Hi @mmckerns ! Thanks for the prompt reply!

After I posted I found this pathos issue https://github.com/uqfoundation/pathos/issues/143 where you had suggested to use pool._clear() and/or pool.restart() together with ProcessPool. And thats what I did now:

sampler.run_nested(print_progress=print_progress,dlogz_init=stop_crit,nlive_init=nwalkers,
maxiter = ns_maxiter, maxcall = ns_maxcall,use_stop = ns_use_stop, wt_kwargs={'pfrac': ns_pfrac})   #nlive_batch=1
thread.close()
thread.join()
thread._clear()
thread.restart()

I hope I did it right. I started another run and it seems to work. However, I am still waiting for the outcome (10 hours running already). Since I narrow a bit the parameter priors, I expect that this run will be shorter.

Generally, my tool works perfectly with pathos (and dill). Shorter runs are not a problem. I suspect it is a memory issue, although not resource-related, since I am using really expensive servers with plenty of CPUs and RAM.

I will post when the run is done.

3fon3fonov commented 3 years ago

Ok, narrowing down the priors had some effect and now the run seems to converge for about 15 h....

[15:02:00, 9.19it/s, batch: 0 | bound: 2072 | nc: 25 | ncall: 4264414 | eff(%): 2.585 | loglstar: -inf < 31572.672 < inf | logz: 31365.917 +/- 0.853 | dlogz: 0.111 > 0.100]

However, it stuck in the last step (I am waiting to progress already over 2 h)... It should print the results and save a dill session when dlogz: > 0.100. I suspected it went there, but cannot progress further anymore.

"Top" says that my python3 process had used 103 GB of RAM. Quite large, yes, but definitely not a problem for the hardware, I still have a lot of RAM left in this machine...

So it seems that pool._clear() and/or pool.restart() is not a real solution... What do you think?

mmckerns commented 3 years ago

pathos stores the pools in a singleton, and for fast calculations, it reduces overhead significantly. However, in certain cases, like yours, it can lead to memory issues if clear is not used. You could try to use a pathos.pools._ProcessPool, which doesn't utilize the caching... and has an interface identical to multiprocessing.Pool.

Side note, I remember seeing your tool before -- I built an optimization-based sampling technique that has been demonstrated to produce valid models with at least an order of magnitude speed-up (paper is under review now) -- but the base of the code is in this module: https://github.com/uqfoundation/mystic/blob/master/mystic/samplers.py. May be of interest to you.

mmckerns commented 3 years ago

What I'd also suggest is to monitor the memory, if possible, and see if that becomes an issue with your current code. And if _ProcessPool doesn't work, then we keep digging.

3fon3fonov commented 3 years ago

Thanks a lot, @mmckerns ! Your mystic package looks awesome! I will definitely try it out! I am already using scipy.optimize, which seems to be already included in mystic, s the transition should be smooth!

Yet, let's first see what to do with the reason for this issue.

These days I was trying all possible _ProcessPoos/ProcessPool combinations and still no success. I do not see an error just the dynesty sampler seems to stuck at the end.... I personally think this could be related to dynesty, because sampling via emcee (MCMC) executes with no problem... I wonder what kind of tests I can do with dynesty and pathos to monitor the CPU/memory behavior.

It really sucks because I really need dynesty-like nested Sampling instead of the MCMC. "Lighter" runs are fine; the "heavy" runs, however, fail and I cannot understand why.

mmckerns commented 3 years ago

What do you mean, it gets "stuck" at the end? Like execution "hangs"... ? This can happen if there's in parallel if a queue is expected to be pulled from, and its not happening... and that execution is in blocking mode. Hard to tell without seeing it.

Might I suggest you also try the ThreadPool, if you are having ProcessPool issues... that would at least let you know where you stand at running in parallel.

Also, you might want to include a monitor in the scipy optimization, like this: https://github.com/uqfoundation/mystic/blob/master/examples2/constrained_scipy.py

3fon3fonov commented 3 years ago

Sorry for not reporting my progress. I ended up busy with other stuff. After all, I was able to get my run done but in only special conditions of the dynesty sampler, which have nothing to do with pathos. I think, my problems are related to dynesty, but I am still debugging. I also open an issue in the dynesty GitHub. Josh's answer could be informative, you may like to check:

https://github.com/joshspeagle/dynesty/issues/211

mmckerns commented 3 years ago

Looks like this can be closed. Please reopen if needed.