jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

batch jobs that started but failed before launching the server leave the hub on unrecoverable state #253

Closed lexming closed 8 months ago

lexming commented 1 year ago

Bug description

Once the job scheduler reports that the job is running, batchspawner waits indefinitely to get the server port from the single-user server. https://github.com/jupyterhub/batchspawner/blob/0b15e4fd815fc07b0b8ec7330a7a5997a4b4badb/batchspawner/batchspawner.py#L455-L460

If the job fails to launch the server for any reason and stops, batchspawner will continue to wait until start_timeout is reached.

Expected behaviour

Batchspawner should catch failed jobs even it they already started and abort Spawner.start().

Actual behaviour

JupyterHub waits until start_timeout and prints the usual timeout error. However, at this point it is impossible to start another server. Stopping the server does not work (as it is not running) and the hub will show a popup saying that a new server cannot be started because one is already pending. JupyterHub has to be restarted.

How to reproduce

  1. make a test batch_script with exit 1 as the first script command
  2. launch the server as usual with batchspawner

Your personal set up

Full environment ``` alembic @ file:///tmp/wheelhouse/alembic-1.8.0-py3-none-any.whl async-generator @ file:///tmp/wheelhouse/async_generator-1.10-py3-none-any.whl attrs @ file:///tmp/wheelhouse/attrs-21.4.0-py2.py3-none-any.whl batchspawner==1.2.0 certifi @ file:///tmp/wheelhouse/certifi-2022.5.18.1-py3-none-any.whl certipy @ file:///tmp/wheelhouse/certipy-0.1.3-py3-none-any.whl cffi @ file:///tmp/wheelhouse/cffi-1.15.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl charset-normalizer @ file:///tmp/wheelhouse/charset_normalizer-2.0.12-py3-none-any.whl cryptography @ file:///tmp/wheelhouse/cryptography-37.0.2-cp36-abi3-manylinux_2_24_x86_64.whl entrypoints @ file:///tmp/wheelhouse/entrypoints-0.4-py3-none-any.whl future==0.18.2 greenlet @ file:///tmp/wheelhouse/greenlet-1.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl idna @ file:///tmp/wheelhouse/idna-3.3-py3-none-any.whl importlib-metadata @ file:///tmp/wheelhouse/importlib_metadata-4.11.4-py3-none-any.whl importlib-resources @ file:///tmp/wheelhouse/importlib_resources-5.7.1-py3-none-any.whl Jinja2 @ file:///tmp/wheelhouse/Jinja2-3.1.2-py3-none-any.whl jsonschema @ file:///tmp/wheelhouse/jsonschema-4.6.0-py3-none-any.whl jupyter-telemetry @ file:///tmp/wheelhouse/jupyter_telemetry-0.1.0-py3-none-any.whl jupyterhub @ file:///tmp/wheelhouse/jupyterhub-2.3.1-py3-none-any.whl jupyterhub-moss @ https://github.com/vub-hpc/jupyterhub_moss/archive/refs/tags/v5.5.1.tar.gz Mako @ file:///tmp/wheelhouse/Mako-1.2.0-py3-none-any.whl MarkupSafe @ file:///tmp/wheelhouse/MarkupSafe-2.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl mock==4.0.3 oauthenticator==15.1.0 oauthlib @ file:///tmp/wheelhouse/oauthlib-3.2.0-py3-none-any.whl packaging @ file:///tmp/wheelhouse/packaging-21.3-py3-none-any.whl pamela @ file:///tmp/wheelhouse/pamela-1.0.0-py2.py3-none-any.whl prometheus-client @ file:///tmp/wheelhouse/prometheus_client-0.14.1-py3-none-any.whl pycparser @ file:///tmp/wheelhouse/pycparser-2.21-py2.py3-none-any.whl pycurl==7.43.0.2 pyOpenSSL @ file:///tmp/wheelhouse/pyOpenSSL-22.0.0-py2.py3-none-any.whl pyparsing @ file:///tmp/wheelhouse/pyparsing-3.0.9-py3-none-any.whl pyrsistent @ file:///tmp/wheelhouse/pyrsistent-0.18.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl python-dateutil @ file:///tmp/wheelhouse/python_dateutil-2.8.2-py2.py3-none-any.whl python-json-logger @ file:///tmp/wheelhouse/python_json_logger-2.0.2-py3-none-any.whl requests @ file:///tmp/wheelhouse/requests-2.27.1-py2.py3-none-any.whl ruamel.yaml @ file:///tmp/wheelhouse/ruamel.yaml-0.17.21-py3-none-any.whl ruamel.yaml.clib @ file:///tmp/wheelhouse/ruamel.yaml.clib-0.2.6-cp38-cp38-manylinux1_x86_64.whl six @ file:///tmp/wheelhouse/six-1.16.0-py2.py3-none-any.whl SQLAlchemy @ file:///tmp/wheelhouse/SQLAlchemy-1.4.37-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl tornado @ file:///tmp/wheelhouse/tornado-6.1-cp38-cp38-manylinux2010_x86_64.whl traitlets @ file:///tmp/wheelhouse/traitlets-5.2.2.post1-py3-none-any.whl urllib3 @ file:///tmp/wheelhouse/urllib3-1.26.9-py2.py3-none-any.whl vsc-base==3.4.9 vsc-config @ file:///usr/local/src/vsc-config-master.tar.gz vsc-install==0.17.26 zipp @ file:///tmp/wheelhouse/zipp-3.8.0-py3-none-any.whl ```
Configuration This issue is independent of any configuration settings.
Logs ``` Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.019 JupyterHub user:728] Calling Spawner.start for vsc10122 Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.023 JupyterHub spawner:313] Used environment: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.023 JupyterHub spawner:314] Used default URL: /lab Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.048 JupyterHub batchspawner:291] Spawner submitting job using sbatch --parsable Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.049 JupyterHub batchspawner:292] Spawner submitted script: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #!/bin/bash Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --job-name=spawner-jupyterhub Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --chdir=/user/brussel/101/vsc10122 Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --output=/dev/null Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --export=PATH,LANG,LC_ALL,JUPYTERHUB_API_TOKEN,JPY_API_TOKEN,JUPYTERHUB_CLIENT_ID,JUPYTERHUB_HOST,JUPYTERHUB_OAUTH_CALLBACK_URL,JUPYTERHUB_OAUTH_SCOPES,JUPYTERHUB_USER,JUPYTERHUB_SERVER_NAME,JUPYTERHUB_API_URL,JUPYTERHUB_ACTIVITY_URL,JUPYTERHUB_BASE_URL,JUPYTERHUB_SERVICE_PREFIX,JUPYTERHUB_SERVICE_URL,JUPYTERHUB_DEFAULT_URL,USER,HOME,SHELL Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --get-user-env=L Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --partition=broadwell Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --time=1:00:00 Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: #SBATCH --cpus-per-task=1 Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: set -euo pipefail Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: trap 'echo SIGTERM received' TERM Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: exit 1 Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: module load JupyterHub/2.3.1-GCCcore-10.3.0 Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: srun --export=ALL batchspawner-singleuser jupyterhub-singleuser Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: echo "JupyterLab server ended gracefully" Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: EOF Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.639 JupyterHub batchspawner:295] Job submitted. cmd: sbatch --parsable output: 7837730;hydra Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.648 JupyterHub batchspawner:322] Spawner querying job: squeue -h -j 7837730 -o \'%T %B\' Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [W 2022-12-14 14:05:26.796 JupyterHub base:188] Rolling back dirty objects IdentitySet([]) Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.819 JupyterHub log:189] 302 POST /hub/spawn -> /hub/spawn-pending/vsc10122 (vsc10122@134.184.143.131) 1032.47ms Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.849 JupyterHub scopes:491] Checking access via scope servers Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.850 JupyterHub scopes:402] Argument-based access to /hub/spawn-pending/vsc10122 via servers Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.856 JupyterHub pages:401] vsc10122 is pending spawn Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.870 JupyterHub log:189] 200 GET /hub/spawn-pending/vsc10122 (vsc10122@134.184.143.131) 30.26ms Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.199 JupyterHub batchspawner:437] Job 7837730 still pending Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.339 JupyterHub scopes:491] Checking access via scope read:servers Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.339 JupyterHub scopes:402] Argument-based access to /hub/api/users/vsc10122/server/progress via read:servers Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.706 JupyterHub batchspawner:322] Spawner querying job: squeue -h -j 7837730 -o \'%T %B\' Dec 14 15:05:35 jupyterhub01.phoenix.os conmon[17742]: [W 2022-12-14 14:05:35.798 JupyterHub base:1063] User vsc10122 is slow to start (timeout=10) Dec 14 15:10:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:10:09.386 JupyterHub proxy:821] Proxy: Fetching GET http://127.0.0.1:8001/api/routes Dec 14 15:10:09 jupyterhub01.phoenix.os conmon[17742]: 14:10:09.400 [ConfigProxy] info: 200 GET /api/routes Dec 14 15:10:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:10:09.409 JupyterHub proxy:346] Checking routes Dec 14 15:15:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:15:09.386 JupyterHub proxy:821] Proxy: Fetching GET http://127.0.0.1:8001/api/routes Dec 14 15:15:09 jupyterhub01.phoenix.os conmon[17742]: 14:15:09.397 [ConfigProxy] info: 200 GET /api/routes Dec 14 15:15:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:15:09.401 JupyterHub proxy:346] Checking routes Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: [W 2022-12-14 14:15:26.023 JupyterHub user:807] vsc10122's server failed to start in 600 seconds, giving up. Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: Common causes of this timeout, and debugging tips: Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: 1. Everything is working, but it took too long. Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: To fix: increase `Spawner.start_timeout` configuration Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: to a number of seconds that is enough for spawners to finish starting. Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: 2. The server didn't finish starting, Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: or it crashed due to a configuration issue. Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: Check the single-user server's logs for hints at what needs fixing. Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:15:26.033 JupyterHub user:913] Stopping vsc10122 ```
opoplawski commented 1 year ago

I've been dealing with this as well trying to debug issues. It would be great if batchspawner kept checking the status of the job and killed it when the batch job disappears.