jupyterlite / jupyterlite

Wasm powered Jupyter running in the browser 💡
https://jupyterlite.rtfd.io/en/stable/try/lab
BSD 3-Clause "New" or "Revised" License
3.92k stars 313 forks source link

Pressing "Restart and run all cells" randomly causes silent pyodide kernel crash #1464

Open ogrisel opened 3 months ago

ogrisel commented 3 months ago

Description

Pressing "Restart and run all" randomly causes silent pyodide kernel crashes as seen on the following screen recording:

jupyterlite_pyodide_crash.webm

Reproduce

  1. Go to https://jupyter.org/try-jupyter/lab/ in chrome or firefox
  2. Create a new notebook with a single cell that imports a builtin library such as six
  3. Execute the cell, in general there is no problem at this point.
  4. Press the "Restart and run all cells" but a few times
  5. At some point, the pyodide kernel crashes
  6. The browser dev console displays a message such as:
Trying to send message on removed socket for kernel a7355f7a-283a-4cdb-a37c-ee379fb1bfc7

Expected behavior

Context

Browser Output
Kernel: restarting (a7355f7a-283a-4cdb-a37c-ee379fb1bfc7) [default.js:1370:24](webpack://_JUPYTERLAB.CORE_OUTPUT/node_modules/@jupyterlab/services/lib/kernel/default.js)
Pyodide contents will be synced with Jupyter Contents [index.js:60:28](webpack://jupyterlite/pyodide-kernel-extension/lib/index.js)
Connection lost, reconnecting in 0 seconds. [default.js:1325:20](webpack://_JUPYTERLAB.CORE_OUTPUT/node_modules/@jupyterlab/services/lib/kernel/default.js)
Starting WebSocket: wss://jupyter.org/try-jupyter/api/kernels/a7355f7a-283a-4cdb-a37c-ee379fb1bfc7 [default.js:69:20](webpack://_JUPYTERLAB.CORE_OUTPUT/node_modules/@jupyterlab/services/lib/kernel/default.js)
TypeError: this is undefined
    isReady notebooklspadapter.js:187
    y utils.js:22
    y utils.js:20
    onKernelChanged notebooklspadapter.js:149
    c index.es6.js:555
    emit index.es6.js:513
    emit index.es6.js:112
    restartKernel sessioncontext.js:366
    restart sessioncontext.js:882
    execute index.js:1801
    execute index.es6.js:365
    onClick toolbar.js:1043
    o toolbar.js:667
    React 11
[notebooklspadapter.js:163:20](webpack://_JUPYTERLAB.CORE_OUTPUT/node_modules/@jupyterlab/notebook/lib/notebooklspadapter.js)
Loading micropip, packaging [pyodide.asm.js:10:93500](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loaded micropip, packaging [pyodide.asm.js:10:93796](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
History was unable to be retrieved [history.js:260:20](webpack://_JUPYTERLAB.CORE_OUTPUT/node_modules/@jupyterlab/notebook/lib/history.js)
Loading openssl, ssl [pyodide.asm.js:10:93500](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loaded openssl, ssl [pyodide.asm.js:10:93796](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loading sqlite3 [pyodide.asm.js:10:93500](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loaded sqlite3 [pyodide.asm.js:10:93796](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loading traitlets [pyodide.asm.js:10:93500](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loaded traitlets [pyodide.asm.js:10:93796](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
traitlets already loaded from default channel [pyodide.asm.js:10:93166](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
sqlite3 already loaded from default channel [pyodide.asm.js:10:93166](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loading Pygments, asttokens, decorator, executing, ipython, matplotlib-inline, prompt_toolkit, pure_eval, six, stack_data, wcwidth [pyodide.asm.js:10:93500](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Loaded Pygments, asttokens, decorator, executing, ipython, matplotlib-inline, prompt_toolkit, pure_eval, six, stack_data, wcwidth [pyodide.asm.js:10:93796](https://cdn.jsdelivr.net/pyodide/v0.26.2/full/pyodide.asm.js)
Failed to fetch ipywidgets through the "jupyter.widget.control" comm channel, fallback to fetching individual model state. Reason: Control comm did not respond in time [327.68dbf8491690b3aff1e7.js:1:5783](https://jupyter.org/try-jupyter/extensions/@jupyter-widgets/jupyterlab-manager/static/327.68dbf8491690b3aff1e7.js?v=68dbf8491690b3aff1e7)
Trying to send message on removed socket for kernel a7355f7a-283a-4cdb-a37c-ee379fb1bfc7 3 [kernels.js:108:24](webpack://_JUPYTERLAB.CORE_OUTPUT/packages/kernel/lib/kernels.js)

EDIT: making the restart button work is important, especially since it's not possible to interrupt a long-running cell in JupyterLite for now (see #459).

ogrisel commented 3 months ago

I think I also triggered the same problem after chaining a manual restart (e.g. my using the 0-0 keyboard short cut) followed by a cell execution with the same import statements (import sklearn or import six or any other builtin package of pyodide), for instance via the shift-enter keyboard shortcut.

It seems that this is caused by a race condition: if you wait long enough after a manual restart and the first cell execution, then there is no problem (I think).

jtpio commented 2 months ago

Thanks @ogrisel for opening the issue :+1:

It seems that this is caused by a race condition: if you wait long enough after a manual restart and the first cell execution, then there is no problem (I think).

Right it looks like a race condition that would need to be fixed at some point.

m-stclair commented 2 months ago

I'm seeing a similar issue, and not finding it random -- it's reproducible very consistently across browsers and deployments, including, as @ogrisel noted, your demo notebook.

What I'm noticing is not a kernel crash though. It seems like the kernel successfully runs, but without correctly talking to the UI -- the notebook runs successfully but (mostly) silently. So, for instance, if you have a block of code like:

import time
i = 1
while True:
    with open(f"{i}.txt", "w") as stream:
        stream.write('hi')
    time.sleep(1)
    i += 1

then 'restart and run all' will cause the kernel to write files forever without opportunity for interaction from the user.

Interestingly, if the user provides input while the kernel is initializing, jupyterlite will sometimes start a new kernel, so the user will have a functioning notebook environment running on a different kernel, but the kernel started by the "restart and run all" operation will continue silently writing files.

A similar behavior, though not as severe, and not as consistent, happens with "restart kernel" if executed while the kernel is performing a blocking operation -- the kernel will sometimes crash and restart again if it receives input from the user while it's initializing.

Similar but not identical behaviors occur with the xeus kernel -- it never runs silently after "restart and run all", but rather consistently crashes on its first restart attempt, then successfully restarts a second time -- unless it receives input from the user while initializing, in which case it will crash and attempt another restart, and so on, and so on.

Is anything more known about a cause, and are there any known workarounds? Also, I have no expertise with your codebase, but I would be happy to help investigate or test if there's anything in particular you'd like to point me towards.

ogrisel commented 3 weeks ago

It appears that I can sometimes trigger this problem even with a regular "Restart" when trying to interrupt a long-running execution.

Then I am in a bad state when nothing works anymore: creating a new empty notebook, inserting a cell with a print statement at the beginning of the notebook, executing directly or restarting and executing that cell never completes and never outputs anything.

image

The past (restarted) kernels still show up with "No session connected" in the side panel. I can shutdown them all, but that does not fix the problem.

ogrisel commented 3 weeks ago

I tried to delete all the cache / local storage / cookies for that page in the firefox dev tools and reload the page but executing the cell with the print statement is still stuck.

The only way to recover is to close the browser tab and reopen a new one.

ogrisel commented 2 weeks ago

I confirm I can semi-randomly reproduce the behaviors described in https://github.com/jupyterlite/jupyterlite/issues/1464#issuecomment-2378194811 using this notebook uploaded to https://jupyter.org/try-jupyter/lab/:

https://gist.github.com/ogrisel/50f2a29b14b9ebea503bab8a42ddbb9a

Apparently, it's important to use a heavy enough import statement (such as pandas) in the second cell to reproduce the problem by clicking "restart and run all" a few times.

The generated log file that writes in the local storage of the browser shows that the kernels often continue executing even when the kernel becomes detached from any session.

I reproduce similar problems both with Chrome and Firefox.

Note that I could even randomly crash the full Chrome tab using a more complex variant of this notebook that would further import sklearn and read a few MB CSV file using pandas from the local storage, but I chose to just link to the simpler variant of the notebook to reproduce the first race condition.

EDIT: I found more minimal reproducers for Firefox and Chrome in:

ogrisel commented 1 week ago

I tried to see if I could reproduce the "No sessions connected" state by clicking the "restart and run all cells" or "restart" buttons in a regular jupyter lab setup and I can never trigger this.

So this is really a problem with pyodide kernels started in web workers by JupyterLite. JupyterLite needs to make sure that those workers are properly shutdown; otherwise this rapidly triggers the memory usage problem and crashes described above in practice.

ogrisel commented 1 week ago

I also tried with the xeus-python kernel using https://jupyterlite.github.io/xeus-python-demo/lab/index.html, and I can reproduce the same problem: I can also leak many "No sessions connected" kernels by repetitively clicking the "restart and run all cells" button and make chrome crash as a result.

So this is not a pyodide specific problem, but rather a generic bug in JupyterLite itself.