jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

jupyterhub cannot talk to session started in Slurm #231

Closed pelacables closed 2 years ago

pelacables commented 2 years ago

Bug description

BatchSpawner is able to submit the job, the job starts in the compute node but the Hub never connects to the session in the compute node.

Expected behaviour

The Hub connects to the session running in the compute node

Actual behaviour

The Hub stucks in "Cluster job running... waiting to connect"

How to reproduce

  1. Start Jupyterhub
  2. Select a session that starts a session in the cluster
  3. wait for hub to start and connect to the cluster
  4. Session fails with error like:

Spawn failed: Server at http://nodeXYZ:52303/user/arnau.bria@boehringer-ingelheim.com/local/ didn't respond in 300 seconds

Your personal set up

Using an environment module based installation. Batch system: SLURM wrapspawner.ProfilesSpawner to be able to choose between local and cluster sessions Auth used: OAuth (OpenID) In sudoers I've commented out secure_path

netaddr                       0.8.0
netifaces                     0.10.9
nose                          1.3.7
notebook                      6.4.8
numexpr                       2.7.3
numpy                         1.20.3
oauthenticator                14.2.0
oauthlib                      3.2.0
packaging                     21.3
pamela                        1.0.0
pandas                        1.2.4
pandocfilters                 1.5.0
paramiko                      2.7.2
parso                         0.8.3
pastel                        0.2.1
pathlib2                      2.3.5
pathspec                      0.9.0
paycheck                      1.0.2
pbr                           5.6.0
pexpect                       4.8.0
pickleshare                   0.7.5
pip                           21.1.1
pkginfo                       1.7.0
platformdirs                  2.4.1
pluggy                        0.13.1
poetry                        1.1.6
poetry-core                   1.0.3
prometheus-client             0.13.1
prompt-toolkit                3.0.26
psutil                        5.8.0
ptyprocess                    0.7.0
pure-eval                     0.2.2
py                            1.10.0
py-expression-eval            0.3.13
pyasn1                        0.4.8
pybind11                      2.6.2
pycparser                     2.21
pycrypto                      2.6.1
pycurl                        7.44.1
Pygments                      2.11.2
pyjet                         1.8.2
PyJWT                         2.3.0
pylev                         1.3.0
PyNaCl                        1.4.0
pyOpenSSL                     22.0.0
pyparsing                     3.0.7
pyrsistent                    0.18.1
pytest                        6.2.4
python-dateutil               2.8.2
python-json-logger            2.0.2
pytoml                        0.1.21
pytz                          2021.3
pyzmq                         22.3.0
regex                         2021.4.4
requests                      2.27.1
requests-toolbelt             0.9.1
ruamel.yaml                   0.17.20
ruamel.yaml.clib              0.2.6
scandir                       1.10.0
scipy                         1.6.3
SecretStorage                 3.3.1
semantic-version              2.8.5
Send2Trash                    1.8.0
setuptools                    56.2.0
setuptools-rust               0.12.1
setuptools-scm                6.0.1
shellingham                   1.4.0
simplegeneric                 0.8.1
simplejson                    3.17.2
six                           1.16.0
smmap                         5.0.0
sniffio                       1.2.0
snowballstemmer               2.1.0
sortedcontainers              2.3.0
Sphinx                        4.0.0
sphinx-bootstrap-theme        0.7.1
sphinxcontrib-applehelp       1.0.2
sphinxcontrib-devhelp         1.0.2
sphinxcontrib-htmlhelp        1.0.3
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.3
sphinxcontrib-serializinghtml 1.1.4
sphinxcontrib-websupport      1.2.4
SQLAlchemy                    1.4.31
stack-data                    0.1.4
tabulate                      0.8.9
terminado                     0.13.1
testpath                      0.5.0
threadpoolctl                 2.1.0
toml                          0.10.2
tomli                         2.0.0
tomlkit                       0.7.0
tornado                       6.1
tornado-proxy-handlers        0.0.5
traitlets                     5.1.1
traittypes                    0.2.1
typing-extensions             3.10.0.2
ujson                         4.0.2
urllib3                       1.26.8
virtualenv                    20.4.6
wcwidth                       0.2.5
webencodings                  0.5.1
websocket-client              1.2.3
wheel                         0.36.2
widgetsnbextension            3.5.2
wrapspawner                   1.0.0
xlrd                          2.0.1
zipp                          3.4.1

Multiple servers

c.JupyterHub.allow_named_servers = True c.JupyterHub.named_server_limit_per_user = 5

BatchSpawner

import batchspawner c.JupyterHub.bind_url = 'https://servername:8000/' c.JupyterHub.default_url = 'home' c.JupyterHub.hub_ip = 'servername' c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner' c.Spawner.http_timeout = 300 c.ProfilesSpawner.profiles = [ ("Local server", 'local', 'jupyterhub.spawner.LocalProcessSpawner', {'ip':'0.0.0.0'} ), ('Interactive Cluster - 2 cores, 4 GB, 8 hours', 'lab', 'batchspawner.SlurmSpawner', dict(req_nprocs='2', req_partition='short',)), ]

c.SlurmSpawner.batch_script = '''#!/bin/bash

SBATCH --output=/home/bria/jupyterhubslurmspawner%j.log

SBATCH --job-name=spawner-jupyterhub

SBATCH --chdir=/home/bria

SBATCH --get-user-env=L

SBATCH --partition=short

SBATCH --cpus-per-task=2

SBATCH --export=ALL

set -euo pipefail ml load jupyterhub/2.1.1-foss-2021a-Python-3.9.5

trap 'echo SIGTERM received' TERM

srun batchspawner-singleuser jupyterhub-singleuser echo "jupyterhub-singleuser ended gracefully" '''


- <details><summary>Logs</summary>
Server logs:

[I 2022-02-14 12:55:28.098 JupyterHub provider:607] Creating oauth client jupyterhub-user-arnau.bria%40boehringer-ingelheim.com-local [I 2022-02-14 12:55:28.129 JupyterHub batchspawner:262] Spawner submitting job using sudo -E -u arnau.bria@boehringer-ingelheim.com sbatch --parsable [I 2022-02-14 12:55:28.129 JupyterHub batchspawner:263] Spawner submitted script:

!/bin/bash

#SBATCH --output=/home/bria/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=spawner-jupyterhub
#SBATCH --chdir=/home/bria
#SBATCH --get-user-env=L
#SBATCH --partition=short
#SBATCH --cpus-per-task=2
#SBATCH --export=ALL

set -euo pipefail
env
ml load jupyterhub/2.1.1-foss-2021a-Python-3.9.5

trap 'echo SIGTERM received' TERM

srun batchspawner-singleuser jupyterhub-singleuser
echo "jupyterhub-singleuser ended gracefully"

[I 2022-02-14 12:55:28.283 JupyterHub batchspawner:266] Job submitted. cmd: sudo -E -u arnau.bria@boehringer-ingelheim.com sbatch --parsable output: 135 98295 [W 2022-02-14 12:55:29.078 JupyterHub base:188] Rolling back dirty objects IdentitySet([<Server(0.0.0.0:0)>]) [I 2022-02-14 12:55:29.086 JupyterHub log:189] 302 POST /hub/spawn/arnau.bria@boehringer-ingelheim.com/local -> /hub/spawn-pending/arnau.bria@boehringer -ingelheim.com/local (arnau.bria@boehringer-ingelheim.com@172.25.146.121) 1006.21ms [I 2022-02-14 12:55:29.140 JupyterHub pages:405] arnau.bria@boehringer-ingelheim.com:local is pending spawn [I 2022-02-14 12:55:29.145 JupyterHub log:189] 200 GET /hub/spawn-pending/arnau.bria@boehringer-ingelheim.com/local (arnau.bria@boehringer-ingelheim.com @172.25.146.121) 10.36ms [W 2022-02-14 12:55:38.079 JupyterHub base:1044] User arnau.bria@boehringer-ingelheim.com:local is slow to start (timeout=10) [I 2022-02-14 12:55:44.533 JupyterHub log:189] 200 POST /hub/api/batchspawner (arnau.bria@boehringer-ingelheim.com@148.191.96.65) 14.06ms [I 2022-02-14 12:55:44.852 JupyterHub batchspawner:419] Notebook server job 13598295 started at computenode:52303 [I 2022-02-14 12:55:47.869 JupyterHub log:189] 200 GET /hub/api (@148.191.96.65) 0.96ms [I 2022-02-14 12:55:47.896 JupyterHub log:189] 200 POST /hub/api/users/arnau.bria@boehringer-ingelheim.com/activity (arnau.bria@boehringer-ingelheim.com @148.191.96.65) 12.08ms 13:00:18.304 [ConfigProxy] info: 200 GET /api/routes [W 2022-02-14 13:00:33.138 JupyterHub user:812] arnau.bria@boehringer-ingelheim.com's server never showed up at http://inhccne0101:52303/user/arnau.bria @boehringer-ingelheim.com/local/ after 300 seconds. Giving up. [...] [I 2022-02-14 13:00:33.292 JupyterHub batchspawner:431] Stopping server job 13598295 [I 2022-02-14 13:00:33.293 JupyterHub batchspawner:316] Cancelling job 13598295: sudo -E -u arnau.bria@boehringer-ingelheim.com scancel 13598295


Client logs:

[I 2022-02-14 12:55:45.783 SingleUserLabApp mixins:615] Starting jupyterhub single-user server version 2.1.1 [I 2022-02-14 12:55:45.783 SingleUserLabApp mixins:629] Extending jupyterlab.labhubapp.SingleUserLabApp from jupyterlab 3.2.8 [I 2022-02-14 12:55:45.783 SingleUserLabApp mixins:629] Extending jupyter_server.serverapp.ServerApp from jupyter_server 1.13.4 /apps/prod/easybuild/sl7.x86_64.foss-2021a/software/jupyterhub/2.1.1-foss-2021a-Python-3.9.5/lib/python3.9/site-packages/jupyter_servermathjax/app.py:40: FutureWarning: The alias `()will be deprecated. Usei18n()` instead. help=("""The MathJax.js configuration file that is to be used."""), [I 2022-02-14 12:55:46.901 SingleUserLabApp manager:345] jupyter_resource_usage | extension was successfully linked. [I 2022-02-14 12:55:46.910 SingleUserLabApp manager:345] jupyter_server_mathjax | extension was successfully linked. [I 2022-02-14 12:55:46.924 SingleUserLabApp manager:345] jupyterlab | extension was successfully linked. [I 2022-02-14 12:55:46.924 SingleUserLabApp manager:345] jupyterlab_git | extension was successfully linked. [I 2022-02-14 12:55:46.924 SingleUserLabApp manager:345] jupyterlab_iframe | extension was successfully linked. [I 2022-02-14 12:55:47.594 SingleUserLabApp manager:345] nbclassic | extension was successfully linked. [I 2022-02-14 12:55:47.594 SingleUserLabApp manager:345] nbdime | extension was successfully linked. [I 2022-02-14 12:55:47.671 SingleUserLabApp manager:367] nbclassic | extension was successfully loaded. [I 2022-02-14 12:55:47.673 SingleUserLabApp manager:367] jupyter_resource_usage | extension was successfully loaded. [I 2022-02-14 12:55:47.673 SingleUserLabApp manager:367] jupyter_server_mathjax | extension was successfully loaded. [I 2022-02-14 12:55:47.675 LabApp] JupyterLab extension loaded from /apps/prod/easybuild/sl7.x86_64.foss-2021a/software/jupyterhub/2.1.1-foss-2021a-Python-3.9.5/lib/python3.9/site-packages/jupyterlab [I 2022-02-14 12:55:47.675 LabApp] JupyterLab application directory is /apps/prod/easybuild/sl7.x86_64.foss-2021a/software/jupyterhub/2.1.1-foss-2021a-Python-3.9.5/share/jupyter/lab [I 2022-02-14 12:55:47.680 SingleUserLabApp manager:367] jupyterlab | extension was successfully loaded. [I 2022-02-14 12:55:47.686 SingleUserLabApp manager:367] jupyterlab_git | extension was successfully loaded. [I 2022-02-14 12:55:47.686 SingleUserLabApp manager:367] jupyterlab_iframe | extension was successfully loaded. [I 2022-02-14 12:55:47.859 SingleUserLabApp manager:367] nbdime | extension was successfully loaded. [I 2022-02-14 12:55:47.860 SingleUserLabApp mixins:641] Starting jupyterhub-singleuser server version 2.1.1 [I 2022-02-14 12:55:47.869 SingleUserLabApp serverapp:2581] Serving notebooks from local directory: /home/bria [I 2022-02-14 12:55:47.869 SingleUserLabApp serverapp:2581] Jupyter Server 1.13.4 is running at: [I 2022-02-14 12:55:47.869 SingleUserLabApp serverapp:2581] http://computenode:52303/user/arnau.bria@boehringer-ingelheim.com/local/lab [I 2022-02-14 12:55:47.869 SingleUserLabApp serverapp:2581] or http://127.0.0.1:52303/user/arnau.bria@boehringer-ingelheim.com/local/lab [I 2022-02-14 12:55:47.869 SingleUserLabApp serverapp:2582] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [I 2022-02-14 12:55:47.879 SingleUserLabApp mixins:597] Updating Hub with activity every 300 seconds srun: Job step aborted: Waiting up to 12 seconds for job step to finish. slurmstepd: error: JOB 13598295 ON inhccne0101 CANCELLED AT 2022-02-14T13:00:33 slurmstepd: error: STEP 13598295.0 ON inhccne0101 CANCELLED AT 2022-02-14T13:00:33 [C 2022-02-14 13:00:33.444 SingleUserLabApp serverapp:2037] received signal 15, stopping


</details>
welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

jbaksta commented 2 years ago

Firewall perhaps? Have you looked at the state of configurable-http-proxy to make sure the proper route is there.

Also, it's been a while, but maybe get rid of the srun as well for a quick bit.

mbmilligan commented 2 years ago

Hello - is the provided configuration sample the literal content of your config.py or similar?

If so, it appears that you are overriding the SlurmSpawner.batch_script configuration with a template that does not include the various template tags necessary for BatchSpawner to correctly compose the job script. My recommendation would be to not use the batch_script configuration variable at all, and instead set the options you need using BatchSpawner's provided override variables.

Another possible area of concern - in your logs, I see the server call itself http://computenode:52303 while the Hub logs refer to both computenode:52303 and http://inhccne0101:52303. I would investigate if your DNS setup is possibly causing the Hub to look for the server at a different IP address than the server listens on.

pelacables commented 2 years ago

@jbaksta firewall is not a problem, I can reach the remote session from the jupyterhub server using lynx.

I've removed my sbatch script but srun still comes from batchspawner.py . But I still need my own sbatch script because I run everything under environment modules and need to prepare the env a bit, otherwise it does not find any executable. Could you please tell mw how it should look like?

pelacables commented 2 years ago

@mbmilligan the server names were omitted (except for inhccne0101 you've found). Apart from that the config is what I have, yes. as mentioned in my previous message, I need my custom sbatch because I need to load an environment module and make sure my job export ALL the env variables. Any example that is using the template tags correctly? It has been really hard for me to find good config examples.

mbmilligan commented 2 years ago

@pelacables you are right, we would like to have this better documented. Referencing what we have, the key parts are

  1. From the README: https://github.com/jupyterhub/batchspawner/blob/f4659678061dacf5837fb15ce340230dca18f3ca/README.md?plain=1#L50-L53

  2. You then have to look at the source for the spawner class you are using to see the template: https://github.com/jupyterhub/batchspawner/blob/f4659678061dacf5837fb15ce340230dca18f3ca/batchspawner/batchspawner.py#L699-L724

Putting those together the result is that your module load command would be dropped into the SlurmSpawner.req_prologue configurable, and I apologize that there isn't a more natural way to discover that usage. I've created issue #232 to remind us to improve that.

pelacables commented 2 years ago

thanks @mbmilligan . My conf now looks like:

c.ProfilesSpawner.profiles = [
   ("Local server", 'local', 'jupyterhub.spawner.LocalProcessSpawner', {'ip':'0.0.0.0'} ),
   ('Interactive Cluster - 2 cores, 4 GB, 8 hours', 'lab', 'batchspawner.SlurmSpawner',
      dict(req_nprocs='2', req_partition='short',req_prologue='ml load jupyterhub',req_keepvars='ALL')),
]

But I still have the conenction error. In jupyterhub I see:

Cluster job running... waiting to connect
Spawn failed: Server at http://computenode:59992/user/arnau.bria@domain.com/ didn't respond in 300 seconds

(computenode is the real node name and domain.com is company's name)

jupyterhub log:

[I 2022-03-08 13:26:07.226 JupyterHub batchspawner:262] Spawner submitting job using sudo -E -u arnau.bria@domain.com sbatch --parsable
[I 2022-03-08 13:26:07.226 JupyterHub batchspawner:263] Spawner submitted script:
    #!/bin/bash
    #SBATCH --output=/home/bria/jupyterhub_slurmspawner_%j.log
    #SBATCH --job-name=spawner-jupyterhub
    #SBATCH --chdir=/home/bria
    #SBATCH --export=ALL
    #SBATCH --get-user-env=L
    #SBATCH --partition=short
    #SBATCH --cpus-per-task=2

    set -euo pipefail

    trap 'echo SIGTERM received' TERM
    ml load jupyterhub
    which jupyterhub-singleuser
    srun batchspawner-singleuser jupyterhub-singleuser --debug
    echo "jupyterhub-singleuser ended gracefully"

[I 2022-03-08 13:26:07.416 JupyterHub batchspawner:266] Job submitted. cmd: sudo -E -u arnau.bria@domain.com sbatch --parsable output: 13834956
[W 2022-03-08 13:26:08.182 JupyterHub base:188] Rolling back dirty objects IdentitySet([<Server(0.0.0.0:0)>])
[I 2022-03-08 13:26:08.187 JupyterHub log:189] 302 POST /hub/spawn/arnau.bria@domain.com -> /hub/spawn-pending/arnau.bria@domain.com (arnau.bria@domain.com@148.190.212.180) 1007.66ms
[I 2022-03-08 13:26:08.246 JupyterHub pages:405] arnau.bria@domain.com is pending spawn
[W 2022-03-08 13:26:17.182 JupyterHub base:1044] User arnau.bria@domain.com is slow to start (timeout=10)
[I 2022-03-08 13:26:50.275 JupyterHub log:189] 200 POST /hub/api/batchspawner (arnau.bria@domain.com
@148.191.97.65) 14.46ms
[I 2022-03-08 13:26:50.276 JupyterHub batchspawner:419] Notebook server job 13834956 started at computenode:59992
[I 2022-03-08 13:27:05.676 JupyterHub log:189] 200 GET /hub/api (@148.191.97.65) 1.51ms
[I 2022-03-08 13:27:05.777 JupyterHub log:189] 200 POST /hub/api/users/arnau.bria@domain.com/activit
y (arnau.bria@domain.com@148.191.97.65) 64.91ms
13:30:11.894 [ConfigProxy] info: 200 GET /api/routes
[W 2022-03-08 13:31:23.622 JupyterHub user:812] arnau.bria@domain.com's server never showed up at ht
tp://computenode:59992/user/arnau.bria@domain.com/ after 300 seconds. Giving up.

and the job log can be found here

Firewall is not an issue, I can talk to the compute node and from the compute node to the server on each respective ports.

anything I can do to increase the verbosity? already have "c.Spawner.args = ['--debug' ]" in the config file. How can I check the routes? Maybe it is not being added as @jbaksta mentioned

jbaksta commented 2 years ago

I would absolutely double check your proxy routes to make sure it's correct. Your job logs and server logs show same port, so I'd certainly be curious if the route in the proxy is correct.

pelacables commented 2 years ago

@jbaksta you are right, the new route is not added:

# curl -H "Authorization: token $CONFIGPROXY_AUTH_TOKEN" http://127.0.0.1:8001/api/routes
{"/":{"hub":true,"target":"http://jupyterhub_server:8081","jupyterhub":true,"last_activity":"2022-03-09T17:44:15.927Z"}}

what can be the reason? how to debug?

jbaksta commented 2 years ago

Maybe just start with this in your hub configuration file c.ConfigurableHTTPProxy.debug = True and see what comes through your logs. Check Configurable HTTP Proxy directly and what JupyterHub thinks the routes are by using the REST API.

mbmilligan commented 2 years ago

You can increase the logging verbosity of the Hub via the configuration file, like so:

c.Application.log_level = 'DEBUG'

pelacables commented 2 years ago

I do not see JupyterHub adding any route at all. I just asks for the routes all the time:

[D 2022-03-10 14:24:04.192 JupyterHub scopes:301] Authenticated with token <APIToken('b99c...', user='arnau.bria@domain.com', client_id='jup
yterhub')>
[I 2022-03-10 14:24:04.195 JupyterHub log:189] 200 POST /hub/api/batchspawner (arnau.bria@domain.com@laptopIP) 13.31ms
[I 2022-03-10 14:24:04.545 JupyterHub batchspawner:419] Notebook server job 13851975 started at computenode:55969
[D 2022-03-10 14:24:04.552 JupyterHub spawner:1221] Polling subprocess every 30s
[I 2022-03-10 14:24:12.427 JupyterHub log:189] 200 GET /hub/api (@laptopIP) 0.89ms
[D 2022-03-10 14:24:13.424 JupyterHub scopes:301] Authenticated with token <APIToken('b99c...', user='arnau.bria@domain.com', client_id='jup
yterhub')>
[D 2022-03-10 14:24:13.425 JupyterHub scopes:491] Checking access via scope users:activity
[D 2022-03-10 14:24:13.425 JupyterHub scopes:402] Argument-based access to /hub/api/users/arnau.bria@domain.com/activity via users:activity
[D 2022-03-10 14:24:13.427 JupyterHub users:855] Activity for user arnau.bria@domain.com: 2022-03-10T13:24:12.042339Z
[D 2022-03-10 14:24:13.427 JupyterHub users:873] Activity on server arnau.bria@domain.com/: 2022-03-10T13:24:12.042339Z
[I 2022-03-10 14:24:13.431 JupyterHub log:189] 200 POST /hub/api/users/arnau.bria@domain.com/activity (arnau.bria@domain.com@computenodeIP) 9.68ms
[D 2022-03-10 14:24:34.554 JupyterHub batchspawner:288] Spawner querying job: sudo -E -u arnau.bria@domain.com squeue -h -j 13851975 -o '%T
%B'
[D 2022-03-10 14:24:58.861 JupyterHub proxy:821] Proxy: Fetching GET http://127.0.0.1:8001/api/routes
14:24:58.862 [ConfigProxy] info: 200 GET /api/routes
[D 2022-03-10 14:24:58.863 JupyterHub proxy:346] Checking routes
[...]

if I create a local session I do see a new route in configurable http proxy.

who/what is telling jupyterhub to add the route? is it the job once it starts?

[I 2022-03-10 11:05:51.864 SingleUserLabApp serverapp:2581] Jupyter Server 1.13.4 is running at:
[I 2022-03-10 11:05:51.864 SingleUserLabApp serverapp:2581] http://computenode:54949/user/arnau.bria@domain.com/lab
[I 2022-03-10 11:05:51.864 SingleUserLabApp serverapp:2581]  or http://127.0.0.1:54949/user/arnau.bria@domain.com/lab
[I 2022-03-10 11:05:51.864 SingleUserLabApp serverapp:2582] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2022-03-10 11:05:52.217 SingleUserLabApp mixins:597] Updating Hub with activity every 300 seconds
[D 2022-03-10 11:05:52.217 SingleUserLabApp mixins:559] Notifying Hub of activity 2022-03-10T10:05:50.664183Z

sorry @jbaksta how can I ask jupyterhub about the routes? the link you gave me shows errors and I do not find anything in regards routes in here

jbaksta commented 2 years ago

The routes are added by the hub. In batchspawner, when a job starts, there is a call to make a POST to the rest API which is added by configuring batchspawner in the hub. below is what a snippet should look like when a route is added.

2022-03-10T07:03:04.002757-07:00 jupyterhub-dev: [D 2022-03-10 07:03:04.002 JupyterHub proxy:832] Proxy: Fetching POST http://127.0.0.1:8801/api/routes/dev/user/USER
2022-03-10T07:03:04.003915-07:00 jupyterhub-dev: 07:03:04.003 [ConfigProxy] #033[32minfo#033[39m: Adding route /dev/user/USER -> http://TARGET_HOST:55357
2022-03-10T07:03:04.004342-07:00 jupyterhub-dev: 07:03:04.004 [ConfigProxy] #033[32minfo#033[39m: 201 POST /api/routes/dev/user/USER
2022-03-10T07:03:59.806056-07:00 jupyterhub-dev: [D 2022-03-10 07:03:59.805 JupyterHub proxy:832] Proxy: Fetching DELETE http://127.0.0.1:8801/api/routes/dev/user/USER

All of this being said, I'm still running JupyterHub 1.5.0 and haven't migrated to the JupyterHub 2.0 series. Maybe batchspawner has a bug with JupyterHub 2.0+.

All in all, the basic idea for starting is:

  1. Request job via configured batch script to scheduler
  2. Poll for job start via scheduler/resource manager
  3. Job Starts, single user notebook starts and makes a POST to /hub/api/batchspawner regarding which port it used (can see this in your logs above)
  4. Hub takes port from the POST and adds it to the proxy as a route for the single user server
  5. Hub redirects to the newly added route where the proxy does it's thing

As far as debugging this further, again, inspect the routes inside the configurable-http-proxy itself. Read this. Also, query the routes from JupyterHub. You'll need a token with admin permissions IIRC. I'm not exactly sure that means for JupyterHub 2.0 scopes yet.

If you don't believe it's getting the routes, you might have to use the Python debugger to inspect some state and make sure you see a post to the CHP or add more logging in places, but the messages are usually sufficient.

mbmilligan commented 2 years ago

FWIW I don't think the problem is with the proxy setup. The Hub won't add the proxy route until the notebook/lab server is confirmed running, and that didn't respond in 300 seconds message is telling you that it was unable to get to that point. Specifically, as the last step before finalizing the spawn, the Hub attempts to connect (directly, not via the proxy) to the host:port where the server is running, and that message is generated if this attempt times out. Seeing this message means that either the server died before responding (unlikely, the client log you posted shows it living until the scancel fires), or a networking issue is preventing the Hub from completing that connection.

It does appear that the server is successfully making connections back to the Hub, but it's worthwhile to verify that connections in the other direction work (e.g. in some clusters only outbound connections from compute nodes are allowed). I would suggest testing this: from the Hub node, once the message Notebook server job 13598295 started at computenode:52303 appears, try to manually query it using netcat or curl. Expected behavior would be an HTTP error code response (since you won't have set up the request correctly) but given the above log message you will probably get some variation of connection refused instead.

pelacables commented 2 years ago
# curl  http://inhccne0501:51808/user/arnau.bria@boehringer-ingelheim.com/
curl: (52) Empty reply from server

if I use fqdn it works. There's a proxy that is blocking the connection! (I tried with lynx in the past and, after some errors, it managed to access it, I guess the proxy is adding the domain at some point. That's why I mentined that I could reach the jupyter server from the hub server. My fault on not using curl)

is there a way to tell jupyterhub to use fqdn? because the proxy is set in many places and will be hard to unset it everywhere (And I'm not the admin for the computenodes)

Already tried passing c.JupyterHub.subdomain_host = 'domain.com' but it keeps using nfqdn.

mbmilligan commented 2 years ago

Oh, good, we have a straightforward solution for that. The Hub blindly uses the hostname as reported by the spawner, and BatchSpawner uses the scheduler query mechanism (squeue in the case of Slurm) to get the compute node name. Since this isn't always the canonical reachable name, a mechanism is provided to transform that. Try adding the config line:

c.SlurmSpawner.state_exechost_exp = r'\1.fqdn.com'

pelacables commented 2 years ago

great!! now it's working! thanks a lot @mbmilligan. I close this issue.