`RuntimeError: Kernel didn't respond in 60 seconds`, when trying to run papermill with python multiprocessing

franzoni315 commented 6 years ago

Hello, I am trying to run multiple parameterized notebooks in parallel. Currently, I am using papermill inside Jupyter Notebook and if I try to use multiprocessing pool to map a list of parameters as pass them to pm.execute_notebook, I get RuntimeError: Kernel didn't respond in 60 seconds. I am running everything with Python 2.7.

This is the code I use:

import papermill as pm
import multiprocessing as mp

def run_nb(data):
    d1, d2 = data
    pm.execute_notebook(in_nb, out_nb, parameters = dict(d1=d1, d2=d2) )

pool = mp.Pool(4)
pool.map(run_nb, zip(data1, data2))
pool.close()
pool.join()

It works correctly using the standard python map.

Btw, is there a known way to produce multiple notebooks in parallel with papermill?

Thanks!

MSeal commented 6 years ago

Hi @franzoni315

So it looks like there's race conditions in the ipython kernel launching parallel processes. Using a threadpool instead get's to run more often without hanging but any high parallelism doesn't beat the race conditions. I ran a few times under different conditions and eventually got

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/mseal/.py2local/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-121>", line 2, in initialize
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 467, in initialize
    self.init_sockets()
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 239, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 181, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
    raise ZMQError(errno)
ZMQError: Address already in use

And

Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 3231, in atexit_operations
    self.history_manager.end_session()
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/IPython/core/history.py", line 580, in end_session
    self.writeout_cache()
  File "<decorator-gen-23>", line 2, in writeout_cache
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/IPython/core/history.py", line 60, in needs_sqlite
    return f(self, *a, **kw)
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/IPython/core/history.py", line 786, in writeout_cache
    self._writeout_input_cache(conn)
  File "/home/mseal/.py2local/local/lib/python2.7/site-packages/IPython/core/history.py", line 770, in _writeout_input_cache
    (self.session_number,)+line)
DatabaseError: database disk image is malformed

I also noticed the race conditions onexit occur every run and are causing session saves to fail (nbd but points to the reuse of session_number which overlaps).

I can also reproduce this failure with a simple bash for loop over papermill. I'll open up a ticket on the ipython project to figure out what the root cause is and see if there's a change in papermill that would fix this.

atronchi commented 5 years ago

FWIW, this problem also affects papermill on jupyter in python 3.

dsosa17 commented 5 years ago

Wondering: what is the status for this issue? I can confirm that this problem is still present in Python3 when using Papermill. Will multiprocessing become doable with Papermill?

MSeal commented 5 years ago

It will become doable -- we need a release for multiple upstream libraries and there's still one pending PR testing an edge case we haven't fixed for one of those releases. Give the community a couple weeks more here, there's a lot of moving parts and it's been unsupported for a long time in the upstream projects. You can pay attention to nbconvert 5.5.1 release announcement on Discourse and on the jupyter mailing list. That will be the last release to get it resolved.

LukaPitamic commented 5 years ago

Hi guys, I'm having same problem with somewhat unpredictable: RuntimeError: Kernel didn't respond in 60 seconds.

I updated nbconvert from GitHub according to: Disable IPython History in executing preprocessor #1017 ...but still bump into same problem

I also tried solution: Papermill on HPC/Dask #364 ...but in this case some of the packages(ta-lib for example) I use rase error when running

MSeal commented 5 years ago

Are you running papermill in a concurrent setting (this isn't supported in upstream libraries yet)? If not, what kernel are you launching. Maybe it's slow to start up?

LukaPitamic commented 5 years ago

@MSeal kernel: python 3.6, linux I don't even know what 'concurrent setting' is, sorry. Is there a simple way to check (even a link for me to dive in would be more than appreciated)

hm..., and maybe you're right about being to slow. Can I increase this 60s limit somewhere?

MSeal commented 5 years ago

By python 3.6 I am assuming you mean your kernel is an ipython kernel in python 3.6? Your kernel and papermill processes don't necessarily share a python version.

--start_timeout <num_seconds_to_wait>

MSeal commented 5 years ago

Checking for concurrent setting means, are you launching papermill from inside a thread or multi-processing setup.

LukaPitamic commented 5 years ago

@MSeal thank you for clarification. I use ipython kernel then. conda environment was created with specifying python=3.6, and I use JupyterLab (if that provides any clarity).

Papermill is run by schedule, and the following async code:

async def myCoroutine():
    while Stop==False:
        schedule.run_pending()
        await asyncio.sleep(1)

asyncio.run_coroutine_threadsafe(myCoroutine(), asyncio.get_event_loop())

I might have understood it totally wrong, but from my understanding this is concurrent setting, right? And as such you're saying "this isn't supported in upstream libraries yet", so if I'm understanding it correctly - for the time being I cannot do anything. Am I correct? And I apologize in advance for my lack of coding knowledge/experiences.

MSeal commented 5 years ago

Yes. Having notebooks launched from a coroutine means there can be concurrent executions, so it's the same issue described above. The good news is that all of the known issues are PR'd or merged pending a release to fix this problem, so it should be fixed in the next few weeks.

LukaPitamic commented 5 years ago

@MSeal thank you very much for clarification. You've been very helpful.

MSeal commented 5 years ago

This base issue should now be resolved with the nbconvert 5.6.0 release!

sabjoslo commented 4 years ago

Hi, I might just be overlooking something, but I think I'm still experiencing this issue even after upgrading nbconvert. It seems to be an upstream issue with nbconvert, because I get the same issues when calling the execute API directly. Let me know if I should migrate this question to that repository.

To replicate:

import multiprocessing as mp
import nbconvert
assert "5.6." in nbconvert.__version__
from nbconvert.preprocessors import ExecutePreprocessor
import nbformat
import os
import papermill as pm

def run_pm(fn):
    pm.execute_notebook(fn, fn, request_save_on_cell_execute = False)

def run(fn):
    with open(fn) as f:
        nb = nbformat.read(f, as_version = 4)        
    ep = ExecutePreprocessor(timeout = None, kernel_name = "python3")
    ep.startup_timeout = 300    
    ep.preprocess(nb, {"metadata": {"path": os.getcwd() + "/"}})    
    with open(fn, "w", encoding = "utf-8") as f:
        nbformat.write(nb, f)

fn = "test.ipynb"

test.ipynb has a single cell that prints the word "testing". The following works fine:

run_pm(fn)
run(fn)

But the following two code snippets each break

pool = mp.Pool(1)
pool.map(run_pm, [fn])
pool.close()
pool.join()

pool = mp.Pool(1)
pool.map(run, [fn])
pool.close()
pool.join()

with error code RuntimeError: Kernel didn't respond in 60 seconds in the first case and RuntimeError: Kernel didn't respond in 300 seconds in the second.

I'm using Python 3.7. I've been able to replicate this with both nbconvert 5.6.0 and 5.6.1.

Thanks!

MSeal commented 4 years ago

What are your versions of ipython, jupyter_client, and jupyter_core in your environment? And how are you running the two Pool snippets at the same time? If you have threads above the mp calls it will break at a C level because of https://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them.

sabjoslo commented 4 years ago

I ran the two Pool snippets in serial, not concurrently. Sorry for the confusion; I've updated my comment to (I hope) reflect that.

My Python 3 package manager lists ipython 7.3.0, jupyter-client 5.2.4 and jupyter-core 4.4.0.

MSeal commented 4 years ago

Sorry for the very late reply, catching up from the holidays.

I believe the issue is you have an old jupyer_client / jupyter_core versions.

Upgrade those to 5.3.4 and 4.6.1 respectfully and the error should go away.

sabjoslo commented 4 years ago

Resolved, thank you!

nteract / papermill

`RuntimeError: Kernel didn't respond in 60 seconds`, when trying to run papermill with python multiprocessing #239