jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

User/client website hangs indefinetly when client-lab crashes after server "waiting to connect" #230

Open afrankra opened 2 years ago

afrankra commented 2 years ago

Bug description

In a batch system environment where the hub uses a batchspawner of any sort to start client-labs. The hub will wait for the batchspawner to report that a batch job with the client-lab has started. After this report has been received, the lab waits to connect and: The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]). In this circumstance the .base waits for the clients-lab forever(timeout), redirecting the user to the waiting page, even though clients-lab already crashed before the connection could be established. If the clients-lab crashes after the connection is established, the user can just try again. If the clients-lab crashes before the connection is established, the user is stuck waiting for ever (until the timeout). Note that this is not the responsibility of the timeout. The timeout can be very long to allow for batch system's to schedule the clients-lab. If the event "waiting to connect" happens and no connection can done because of a crash, the user needs to wait for the global timeout (this global timeout is meant to for the client-labs to be started, which will never happen in a crash). A second timeout could be used to determine how long to wait for a connection after the client-lab started.

Expected behaviour

If the clients-lab crashes after the connection is established, the user can just try again. If the clients-lab-server crashes before the connection is established but after waiting, the GUI should not hang. The user proxy redirect should be removed if the connection cannot be established after a timeout (not c.Spawner.start_timeout but a different timeout).

Actual behaviour

The hupyterhub interface for a user will hang if the client (lab) crashes before the connection can be established (after the web page reports "waiting to connect" [the client-lab started]).

How to reproduce

Jupyterhup starting jupyterlabs with a batchspawner. Force batchscipt to exit before the start of the lab using exit 1 / return 1, etc. Set c.Spawner.start_timeout big enough for the job to be scheduled and started. Start a lab for a user using the web gui. The batchjob will eventually start but the lab will not due to the crash simulated using the 'exit 1 or return 1'. GUI will change from waiting to start to waiting to connect, and will never do so. User cannot try again to start the lab due to redirects.

Your personal set up

Slurm Centos7

Full environment ``` alembic==1.7.5 anyio==3.4.0 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 async-generator==1.10 attrs==21.4.0 Babel==2.9.1 backcall==0.2.0 batchspawner==1.1.0 beautifulsoup4==4.10.0 bleach==4.1.0 bs4==0.0.1 certifi==2021.10.8 certipy==0.1.3 cffi==1.15.0 charset-normalizer==2.0.10 colorama==0.4.4 commonmark==0.9.1 contextvars==2.4 cryptography==36.0.1 dataclasses==0.8 decorator==5.1.0 defusedxml==0.7.1 entrypoints==0.3 greenlet==1.1.2 idna==3.3 immutables==0.16 importlib-metadata==4.8.3 importlib-resources==5.4.0 ipykernel==5.5.6 ipython==7.16.2 ipython-genutils==0.2.0 jedi==0.17.2 Jinja2==3.0.3 json5==0.9.6 jsonschema==3.2.0 jupyter-client==7.1.0 jupyter-core==4.9.1 jupyter-server==1.13.1 jupyter-telemetry==0.1.0 jupyterhub==2.0.1 jupyterhub-moss==1.1.1 jupyterlab==3.2.5 jupyterlab-pygments==0.1.2 jupyterlab-server==2.10.2 Mako==1.1.6 MarkupSafe==2.0.1 mistune==0.8.4 nbclassic==0.3.4 nbclient==0.5.9 nbconvert==6.0.7 nbformat==5.1.3 nest-asyncio==1.5.4 nodeenv==1.6.0 notebook==6.4.6 oauthenticator==14.2.0 oauthlib==3.1.1 packaging==21.3 pamela==1.0.0 pandocfilters==1.5.0 parso==0.7.1 pexpect==4.8.0 pickleshare==0.7.5 pip-search==0.0.10 prometheus-client==0.12.0 prompt-toolkit==3.0.24 ptyprocess==0.7.0 pycparser==2.21 Pygments==2.11.1 pyOpenSSL==21.0.0 pyparsing==3.0.6 pyrsistent==0.18.0 python-dateutil==2.8.2 python-json-logger==2.0.2 pytz==2021.3 pyzmq==22.3.0 requests==2.27.0 rich==10.16.2 ruamel.yaml==0.17.20 ruamel.yaml.clib==0.2.6 Send2Trash==1.8.0 six==1.16.0 sniffio==1.2.0 soupsieve==2.3.1 SQLAlchemy==1.4.29 sudospawner==0.5.2 terminado==0.12.1 testpath==0.5.0 tornado==6.1 traitlets==4.3.3 typing_extensions==4.0.1 urllib3==1.26.7 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==1.2.3 wrapspawner==1.0.0 zipp==3.6.0 ```
Configuration ```python import batchspawner import jupyterhub_moss c.Spawner.start_timeout = 1200 c.JupyterHub.log_level = 'DEBUG' c.Spawner.debug = True ```
welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

minrk commented 2 years ago

Thanks for the report! I've migrated the issue to the batchspawner repo, which should be responsible for handling fault tolerance talking to batch systems.

afrankra commented 2 years ago

FYI Sadly this issue will not be solvable at the spawner level.