ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python
https://ipyparallel.readthedocs.io/
Other
2.59k stars 1.01k forks source link

Timeout of parallel magics #435

Open istvan-fodor opened 3 years ago

istvan-fodor commented 3 years ago

I would like to be able to set a timeout PX magics with running in blocked mode. Currently the cell can run infinitely which is problematic in our use case: we rely on MPI and barriers, and if one worker fails, the barrier holds up all the other workers, making the cell “freeze” infinitely. The only option is to visually observe this and infer that this happened from other sources (log files, etc) and kill the ipyparallel engines.

minrk commented 3 years ago

Good idea. Unfortunately, ipyparallel still doesn't have the ability to send interrupts in general, though you can often make it work depending on the environment using things like:

import os
import socket

e_all = rc[:]
hosts = e_all.apply_async(socket.gethostname).get_dict()
pids = e_all.apply_async(os.getpid).get_dict()

and then it's up to you to send signals to those processes whenever.

An approach that ought to work (on posix) would be to use signal.alarm to interrupt executions if they take too long:

import signal
import time

def interrupt_alarm(*exc_info):
    """raise KeyboardInterrupt on SIGALRM"""
    print("got alarm!")
    raise KeyboardInterrupt()

previous_handle = signal.signal(signal.SIGALRM, interrupt_alarm)
timeout = 2
signal.alarm(timeout)

# here is where your real task goes. If it takes longer than timeout, it will be interrupted.
# this assumes it is interruptible.
time.sleep(timeout + 1)

# got here, we finished. Make sure to clear the alarm
signal.alarm(0)
# may want to clear the alarm handler, but it's also okay to leave it raising interrupts instead of killing the process
signal.signal(signal.SIGALRM, previous_handle)
minrk commented 3 years ago

With the signal/restart/streaming features we have now in 8.0, I think there's a simple missing feature: add a client-side timeout to the parallel magics. The situation is much improved, though:

  1. %px streams output and errors immediately as they happen, so if one engine actually raised or produced useful error output, it will show up immediately to give you the hint that it might not finish
  2. There are now APIs for sending signals and restarting engines, so you can get your cluster back

So the only missing feature is really an optional %%px --timeout to automatically stop the client waiting. Though due to streaming, it will no longer result in more or earlier feedback about the failure, only halting of the cell.