modal-labs / modal-client

Python client library for Modal
https://modal.com/docs
Apache License 2.0
244 stars 31 forks source link

XSRF Issue with Jupyter -- Stuck Modal Queue #1880

Closed tusharkhot closed 3 weeks ago

tusharkhot commented 1 month ago

I am running Jupyter on T4 GPUs and getting XSRF issues that blocks the queue. I have slightly modified the jupyter notebook example for reproduction.

# ---
# args: ["--timeout", 10]
# ---

# ## Overview
#
# Quick snippet showing how to connect to a Jupyter notebook server running inside a Modal container,
# especially useful for exploring the contents of Modal Volumes.
# This uses [Modal Tunnels](https://modal.com/docs/guide/tunnels#tunnels-beta)
# to create a tunnel between the running Jupyter instance and the internet.
#
# If you want to your Jupyter notebook to run _locally_ and execute remote Modal Functions in certain cells, see the `basic.ipynb` example :)

import os
import subprocess
import time

import modal

app = modal.App(
    image=modal.Image.debian_slim().pip_install(
        "jupyter"
    )  # Note: prior to April 2024, "app" was called "stub"
)

def get_modal_url_queue(rank=0):
    """rank allows running multiple instances of modal concurrently by keeping a unique queue for each instance."""
    return modal.Queue.from_name(f"jupyter-url-queue-{rank}", create_if_missing=True)

CACHE_DIR = "/root/cache"
JUPYTER_TOKEN = "1234"  # Change me to something non-guessable!

# This is all that's needed to create a long-lived Jupyter server process in Modal
# that you can access in your Browser through a secure network tunnel.
# This can be useful when you want to interactively engage with Volume contents
# without having to download it to your host computer.

@app.function(concurrency_limit=1, timeout=1_500, gpu="T4")
def run_jupyter(timeout: int):
    jupyter_port = 8888
    with modal.forward(jupyter_port) as tunnel:
        jupyter_process = subprocess.Popen(
            [
                "jupyter",
                "notebook",
                "--no-browser",
                "--allow-root",
                "--ip=0.0.0.0",
                f"--port={jupyter_port}",
                "--NotebookApp.allow_origin='*'",
                "--NotebookApp.allow_remote_access=1",
            ],
            env={**os.environ, "JUPYTER_TOKEN": JUPYTER_TOKEN},
        )

        print(f"Jupyter available at => {tunnel.url}")
        modal_url_queue = get_modal_url_queue(0)
        modal_url_queue.put(tunnel.url)
        # URL will get added but wont be available locally
        print("Added to Queue")
        try:
            end_time = time.time() + timeout
            while time.time() < end_time:
                time.sleep(5)
            print(f"Reached end of {timeout} second timeout period. Exiting...")
        except KeyboardInterrupt:
            print("Exiting...")
        finally:
            jupyter_process.kill()

@app.local_entrypoint()
def main(timeout: int = 10_000):
    # Run the Jupyter Notebook server
    run_jupyter.remote(timeout=timeout)
    # Wait to get the url
    modal_url_queue = get_modal_url_queue(0)
    print("Lets get host from the queue")
    modal_host = modal_url_queue.get()
    # Code will get stuck here even though URL is added to queue
    print("Found host:" + modal_host)

# Doing `modal run jupyter_inside_modal.py` will run a Modal app which starts
# the Juypter server at an address like https://u35iiiyqp5klbs.r3.modal.host.

If you run modal run, you will see the xsrf issue and the code will never reach the "Found host:" line even though the url is added to the queue.

Stack trace:

[I 2024-06-07 21:08:49.195 ServerApp] Jupyter Server 2.14.0 is running at:
[I 2024-06-07 21:08:49.195 ServerApp] http://modal:8888/tree?token=...
[I 2024-06-07 21:08:49.196 ServerApp]     http://127.0.0.1:8888/tree?token=...
[I 2024-06-07 21:08:49.196 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2024-06-07 21:08:49.233 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-server-nodejs, javascript-typescript-langserver, jedi-language-server, julia-language-server, pyright, python-language-server, python-lsp-server, r-languageserver, sql-language-server, texlab, typescript-language-server, unified-language-server, vscode-css-languageserver-bin, vscode-html-languageserver-bin, vscode-json-languageserver-bin, yaml-language-server
[I 2024-06-07 21:08:49.984 JupyterNotebookApp] 302 GET /tree (@172.20.136.1) 0.76ms
[W 2024-06-07 21:08:50.361 ServerApp] 403 POST /api/kernels (172.20.136.1): '_xsrf' argument missing from POST
[W 2024-06-07 21:08:50.361 ServerApp] wrote error: "'_xsrf' argument missing from POST"
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/tornado/web.py", line 1769, in _execute
        result = await result  # type: ignore
      File "/usr/local/lib/python3.9/site-packages/jupyter_server/base/handlers.py", line 751, in prepare
        await super().prepare()
      File "/usr/local/lib/python3.9/site-packages/jupyter_server/base/handlers.py", line 633, in prepare
        self.check_xsrf_cookie()
      File "/usr/local/lib/python3.9/site-packages/jupyter_server/base/handlers.py", line 537, in check_xsrf_cookie
        return super().check_xsrf_cookie()
      File "/usr/local/lib/python3.9/site-packages/tornado/web.py", line 1605, in check_xsrf_cookie
        raise HTTPError(403, "'_xsrf' argument missing from POST")
    tornado.web.HTTPError: HTTP 403: Forbidden ('_xsrf' argument missing from POST)
[W 2024-06-07 21:08:50.363 ServerApp] 403 POST /api/kernels (@172.20.136.1) 2.74ms referer=None
tusharkhot commented 4 weeks ago

Looks like this just got fixed! Thanks! FWIW this has been a transient issue last week and completely stopped working on Friday. Hopefully this was a fix and not just luck.

PS: I also replaced run_jupyter.remote with run_jupyter.spawn and that wasn't working either.

mwaskom commented 3 weeks ago

Let us know if it recurs!

tusharkhot commented 2 weeks ago

Looks like this issue is back. See the app:

ap-7ttuFE3RVDSnprmVJZQkQD fc-01J0Y93AW71N46DZVG17E68503 Input ID: in-01J0Y93AWEK55RRAH7KCX8QQV9

I am not able to reproduce it with my simple script, but you can see the errors in the app

tusharkhot commented 2 weeks ago

@mwaskom Should I open a new issue?

mwaskom commented 2 weeks ago

Hey, googling the error message tells me that this is a fairly common issue with Jupyter notebooks: https://stackoverflow.com/questions/55014094/jupyter-notebook-not-saving-xsrf-argument-missing-from-post

Just did a cursory search and not sure if there's an obvious root cause or way to avoid. But it does make me suspect there may not be a modal-specific issue here?

tusharkhot commented 2 weeks ago

So looks like its an issue with the Modal queue. The hostname gets pushed to jupyter-url-queue-0 but my code keeps waiting for the hostname. I thought its because of the _xsrf error message but its this specific queue. Once I change the queue name to jupyter-url-queue-1, things work fine.