ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python
https://ipyparallel.readthedocs.io/
Other
2.57k stars 1k forks source link

Hang after map_sync is called in loop for a number of iterations #373

Open wentaol opened 5 years ago

wentaol commented 5 years ago

The following script hangs after about ~180 iterations on my machine Number of iterations it gets through seems to be dependent on the time elapsed (hence the sleep)

import numpy as np
import ipyparallel as ipp

c = ipp.Client()
dv = c[:]
dv.execute("import time")
dv.execute("import numpy as np")

v = c.load_balanced_view()

def fun(state):
    time.sleep(0.5)
    return np.random.normal()

# Generation loop
for it in range(1000):
    inputs = range(12)
    #v = c.load_balanced_view()
    outputs = np.array(v.map_sync(fun, inputs))
    print "Iteration", it

Windows 10, Anaconda installation ipyparallel 6.2.4 py27_0

Cluster is started via ipcluster start -n 4 on powershell Tried and same problem observed with both python 2.7 and python 3.7 Tried both load balanced and direct views No error messages and the powershell window for ipcluster becomes unresponsive

zzpwahaha commented 4 years ago

I have seen the exact similar behavior. Anyone has an idea?

jkochNU commented 4 years ago

I also can confirm the same unfortunate failure of ipyparallel, using the exact code posted above. I am uncertain how to troubleshoot this. Help?

Confirmed on: Windows 10, ipyparallel 6.2.4, Python 3.7.6 first run freezes on iteration 30, second run freezes on iteration 174

yhz0 commented 4 years ago

I am experencing similar hangs without error messages when I call map_sync repeatedly. I'm using Windows 10, Python 3.7.7, ipyparallel 6.2.4.

minrk commented 3 years ago

Hangs are really hard to debug. It is suspicious to me that you have all reported the issue on Windows, which makes me think there is some kind of resource exhaustion / hang that occurs only on Windows which I can never reproduce.

The best way to figure this out is to enable debug logging on all resources:

ipcluster start --debug

and set client.debug = True before starting. This will produce an enormous amount of output for over 100 iterations, but I don't know how else to debug without being able to reproduce it.

If you do encounter this hang and can interrupt it in an interactive session (e.g. by running in IPython or a debugger). Can one share:

client.queue_status()
client.outstanding
client.history[-24:]

Has anyone seen this not on Windows?

There is a chance this is #294, in which case it should be fixed by #464. I'm not sure, though.

seekjim20 commented 3 years ago

I have exactly the same problem on Linux. CentOS 7.4.1708, Python 3.7.3, ipyparallel 6.3.0

c.queue_status(): {'unassigned': 0, 0: {'queue': 0, 'completed': 398, 'tasks': 0}, 1: {'queue': 0, 'completed': 380, 'tasks': 0}, 2: {'queue': 0, 'completed': 380, 'tasks': 0}, 3: {'queue': 0, 'completed': 379, 'tasks': 1}, 4: {'queue': 0, 'completed': 380, 'tasks': 0}, 5: {'queue': 0, 'completed': 380, 'tasks': 0}, 6: {'queue': 0, 'completed': 380, 'tasks': 0}, 7: {'queue': 0, 'completed': 380, 'tasks': 0}, 8: {'queue': 0, 'completed': 380, 'tasks': 0}, 9: {'queue': 0, 'completed': 380, 'tasks': 0}, 10: {'queue': 0, 'completed': 380, 'tasks': 0}, 11: {'queue': 0, 'completed': 380, 'tasks': 0}}

c.outstanding {'36a29353-199402148ca89fb53ab4ee6c_627'}

c.history[-24:] ['36a29353-199402148ca89fb53ab4ee6c_613', '36a29353-199402148ca89fb53ab4ee6c_614', '36a29353-199402148ca89fb53ab4ee6c_615', '36a29353-199402148ca89fb53ab4ee6c_616', '36a29353-199402148ca89fb53ab4ee6c_617', '36a29353-199402148ca89fb53ab4ee6c_618', '36a29353-199402148ca89fb53ab4ee6c_619', '36a29353-199402148ca89fb53ab4ee6c_620', '36a29353-199402148ca89fb53ab4ee6c_621', '36a29353-199402148ca89fb53ab4ee6c_622', '36a29353-199402148ca89fb53ab4ee6c_623', '36a29353-199402148ca89fb53ab4ee6c_624', '36a29353-199402148ca89fb53ab4ee6c_625', '36a29353-199402148ca89fb53ab4ee6c_626', '36a29353-199402148ca89fb53ab4ee6c_627', '36a29353-199402148ca89fb53ab4ee6c_628', '36a29353-199402148ca89fb53ab4ee6c_629', '36a29353-199402148ca89fb53ab4ee6c_630', '36a29353-199402148ca89fb53ab4ee6c_631', '36a29353-199402148ca89fb53ab4ee6c_632', '36a29353-199402148ca89fb53ab4ee6c_633', '36a29353-199402148ca89fb53ab4ee6c_634', '36a29353-199402148ca89fb53ab4ee6c_635', '36a29353-199402148ca89fb53ab4ee6c_636']

minrk commented 3 years ago

Thanks for that sample! That suggests that it is not fixed by #464, because that was purely a client-side race.

If this is reliably reproducible for you, can you share the controller's log output as well? Can you also test with the latest 7.0.0a5 in case it happens to be fixed already, even if not by #464?

A complete reproducible example is always hugely helpful, but I realize that's often not feasible for bugs like this one.

minrk commented 3 years ago

I've run this sample locally a few times (macOS 11.5.2, Python 3.9.6, ipyparallel 7.0.0b3), and it completes 1000 iterations without any errors. So I'm going to hope that some of the big refactors in 7.0 have fixed this, possibly also changes in ipykernel 6.

Based on @seekjim20's debug output, the issue is a failure to return one task reply. Since both the client and the Hub agree that the task is not done, it suggests that the message was not delivered to (or not handled properly in) the task scheduler. Checking for the missing msg id (36a29353-199402148ca89fb53ab4ee6c_627) in the task scheduler's debug logs may point to the next step. Or it could have been an error on the engine itself, failing to send the message.