jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

Problem getting SlurmSpawner to work #161

Closed spalkovits closed 5 years ago

spalkovits commented 5 years ago

Hello,

I am reading the issues here now for quite some time but do not find any solution to my problem.

I want to use the SlurmSpawner with a JupyterHub. For testing purposes I installed a clean Ubuntu 18.04 in VirtualBox. The JupyterHub (and the Notebook and Jupyterlab) are installed via conda (Miniconda installed to /opt/conda) form the conda-forge channel. BatchSpawner and ProfileSpawner are the installed via pip from /opt/conda/bin.

Slurm seems to work properly. At least the jobs get queued and run. My jupyter_config.py looks like this:

import batchspawner

c = get_config()

c.JupyterHub.hub_ip = 'bionic'

c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
c.Spawner.http_timeout = 1200

c.BatchSpawnerBase.req_host = 'bionic'
c.BatchSpawnerBase.req_runtime = '12:00:00'

c.SlurmSpawner.batch_script = '''#!/bin/bash
#SBATCH --partition={partition}
#SBATCH --time={runtime}
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub-spawner
#SBATCH --cpus-per-task={nprocs}
#SBATCH --workdir=/home/{username}
#SBATCH --mem={memory}
export PATH=/opt/conda/bin:$PATH
{cmd}
'''

c.ProfilesSpawner.profiles = [
     ('Host process', 'local', 'jupyterhub.spawner.LocalProcessSpawner', {'ip':'0.0.0.0'}),
     ('SLURM', 'bionic', 'batchspawner.SlurmSpawner', dict(req_partition='compute', req_nprocs='1', req_runtime='8:00:00', req_memory='800'))
]

The slurm.conf like this

ControlMachine=bionic
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
NodeName=bionic CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=800
PartitionName=compute Nodes=bionic Shared=YES Default=YES MaxTime=INFINITE State=UP

in the JupyterHubs log I find nothing:

[I 2019-09-20 11:16:08.575 JupyterHub app:2120](B Using Authenticator: jupyterhub.auth.PAMAuthenticator-1.0.0
[I 2019-09-20 11:16:08.575 JupyterHub app:2120](B Using Spawner: wrapspawner.wrapspawner.ProfilesSpawner
[I 2019-09-20 11:16:08.581 JupyterHub app:1257](B Loading cookie_secret from /home/palkovits/jupyterhub_cookie_secret
[I 2019-09-20 11:16:08.598 JupyterHub proxy:460](B Generating new CONFIGPROXY_AUTH_TOKEN
[W 2019-09-20 11:16:08.600 JupyterHub app:1532](B No admin users, admin interface will be unavailable.
[W 2019-09-20 11:16:08.600 JupyterHub app:1534](B Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2019-09-20 11:16:08.601 JupyterHub app:1563](B Not using whitelist. Any authenticated user will be allowed.
[I 2019-09-20 11:16:08.639 JupyterHub app:2337](B Hub API listening on http://bionic:8081/hub/
[W 2019-09-20 11:16:08.641 JupyterHub proxy:642](B Running JupyterHub without SSL.  I hope there is SSL termination happening somewhere else...
[I 2019-09-20 11:16:08.641 JupyterHub proxy:645](B Starting proxy @ http://:8000
[I 2019-09-20 11:16:09.249 JupyterHub proxy:319](B Checking routes
[I 2019-09-20 11:16:09.250 JupyterHub proxy:399](B Adding default route for Hub: / => http://bionic:8081
[I 2019-09-20 11:16:09.258 JupyterHub app:2422](B JupyterHub is now running at http://:8000
[W 2019-09-20 11:16:11.479 JupyterHub log:174](B 404 GET /hub/api/users/palkovits/server/progress (palkovits@::ffff:10.0.2.2) 26.63ms
[I 2019-09-20 11:16:13.639 JupyterHub log:174](B 200 GET /hub/home (palkovits@::ffff:10.0.2.2) 58.84ms
[I 2019-09-20 11:16:15.351 JupyterHub log:174](B 200 GET /hub/spawn/palkovits (palkovits@::ffff:10.0.2.2) 19.18ms
[I 2019-09-20 11:16:18.086 JupyterHub batchspawner:188](B Spawner submitting job using sudo -E -u palkovits sbatch --parsable
[I 2019-09-20 11:16:18.087 JupyterHub batchspawner:189](B Spawner submitted script:
    #!/bin/bash
    #SBATCH --partition=compute
    #SBATCH --time=8:00:00
    #SBATCH --output=/home/palkovits/jupyterhub_slurmspawner_%j.log
    #SBATCH --job-name=jupyterhub-spawner
    #SBATCH --cpus-per-task=1
    #SBATCH --workdir=/home/palkovits
    #SBATCH --mem=800
    export PATH=/opt/conda/bin:$PATH
    jupyterhub-singleuser --port=44643

[I 2019-09-20 11:16:18.119 JupyterHub batchspawner:192](B Job submitted. cmd: sudo -E -u palkovits sbatch --parsable output: 15
[I 2019-09-20 11:16:18.668 JupyterHub batchspawner:330](B Notebook server job 15 started at bionic:44643
[I 2019-09-20 11:16:19.271 JupyterHub log:174](B 200 GET /hub/api (@127.0.0.1) 0.75ms
[I 2019-09-20 11:16:19.308 JupyterHub log:174](B 200 POST /hub/api/users/palkovits/activity (palkovits@127.0.0.1) 22.31ms
[W 2019-09-20 11:16:28.025 JupyterHub base:932](B User palkovits is slow to become responsive (timeout=10)
[I 2019-09-20 11:16:28.049 JupyterHub log:174](B 302 POST /hub/spawn/palkovits -> /hub/spawn-pending/palkovits (palkovits@::ffff:10.0.2.2) 10060.63ms
[I 2019-09-20 11:16:28.067 JupyterHub pages:303](B palkovits is pending spawn
[I 2019-09-20 11:16:28.074 JupyterHub log:174](B 200 GET /hub/spawn-pending/palkovits (palkovits@::ffff:10.0.2.2) 16.31ms
[C 2019-09-20 11:17:26.777 JupyterHub app:2448] Received signal SIGINT, initiating shutdown...
[E 2019-09-20 11:17:26.781 JupyterHub web:1788](B Uncaught exception GET /hub/api/users/palkovits/server/progress (::ffff:10.0.2.2)
    HTTPServerRequest(protocol='http', host='127.0.0.1:8000', method='GET', uri='/hub/api/users/palkovits/server/progress', version='HTTP/1.1', remote_ip='::ffff:10.0.2.2')
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/tornado/web.py", line 1699, in _execute
        result = await result
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/apihandlers/users.py", line 601, in get
        await asyncio.wait([spawn_future])
      File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 363, in wait
        return await _wait(fs, timeout, return_when, loop)
      File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 456, in _wait
        await waiter
    concurrent.futures._base.CancelledError

[E 2019-09-20 11:17:26.790 JupyterHub web:1194](B Cannot send error response after headers written
[I 2019-09-20 11:17:26.798 JupyterHub log:174](B 200 GET /hub/api/users/palkovits/server/progress (palkovits@::ffff:10.0.2.2) 58664.45ms
[E 2019-09-20 11:17:26.799 JupyterHub user:687](B Unhandled error waiting for palkovits's server to show up at http://bionic:44643/user/palkovits/: 
[I 2019-09-20 11:17:26.817 JupyterHub batchspawner:342](B Stopping server job 15
[I 2019-09-20 11:17:26.818 JupyterHub batchspawner:233](B Cancelling job 15: sudo -E -u palkovits scancel 15
[I 2019-09-20 11:17:27.929 JupyterHub app:2155](B Cleaning up single-user servers...
[I 2019-09-20 11:17:27.929 JupyterHub proxy:705](B Cleaning up proxy[3558]...
[I 2019-09-20 11:17:27.932 JupyterHub app:2187](B ...done

Neither do I in the log of the job:

[I 2019-09-20 11:16:19.265 SingleUserNotebookApp extension:155] JupyterLab extension loaded from /opt/conda/lib/python3.7/site-packages/jupyterlab
[I 2019-09-20 11:16:19.265 SingleUserNotebookApp extension:156] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 2019-09-20 11:16:19.267 SingleUserNotebookApp singleuser:561] Starting jupyterhub-singleuser server version 1.0.0
[I 2019-09-20 11:16:19.281 SingleUserNotebookApp notebookapp:1825] Serving notebooks from local directory: /home/palkovits
[I 2019-09-20 11:16:19.281 SingleUserNotebookApp notebookapp:1825] The Jupyter Notebook is running at:
[I 2019-09-20 11:16:19.281 SingleUserNotebookApp notebookapp:1825] http://127.0.0.1:44643/user/palkovits/
[I 2019-09-20 11:16:19.281 SingleUserNotebookApp notebookapp:1826] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2019-09-20 11:16:19.284 SingleUserNotebookApp singleuser:542] Updating Hub with activity every 300 seconds
slurmstepd-bionic: error: *** JOB 15 ON bionic CANCELLED AT 2019-09-20T11:17:26 ***
[C 2019-09-20 11:17:26.834 SingleUserNotebookApp notebookapp:1560] received signal 15, stopping
[I 2019-09-20 11:17:26.834 SingleUserNotebookApp notebookapp:1694] Shutting down 0 kernels

I run the JuypterHub right now just interactively and not as a service

Any feedback would be appreciated.

Regards!

rkdarst commented 5 years ago

Hi,

Looking at this I don't see anything immediately wrong or any immediate cause. I do see this:

In the hub log:

[C 2019-09-20 11:17:26.777 JupyterHub app:2448] Received signal SIGINT, initiating shutdown... ... �[32m[I 2019-09-20 11:17:26.817 JupyterHub batchspawner:342]�(B�[m Stopping server job 15

Then in singleuser-server log:

slurmstepd-bionic: error: JOB 15 ON bionic CANCELLED AT 2019-09-20T11:17:26

So I guess the reason the server is stopping is that you interrupted the job, which makes the hub cancel the starting server. Did you do this yourself because the server didn't start? If not, that's the first place to look.

If the single-user server isn't starting fast enough...

in my own server logs, I see this:

...JupyterHub log:174] 200 POST /hub/api/batchspawner (darstr1@10.10.100.65) 11.58ms

which is how I know it communicates with the hub. I don't see the single-user server communicating with the hub, so could that be it? Does hub_connect_url work to connect back?

One thing I do for debugging is put env in the batch script, so I can check all the variables that get passed and see if something is wrong there. some of the JUPYTERHUB_* ones might give you a clue if it can connect back or not, or if something else isn't getting through or is wrong.

rkdarst commented 5 years ago

Wait... which version are you using, latest master or last release? Last release is quite old and might have some some issues with latest hub, it's been so long I don't remember. Everything in my previous message only applies to the latest master.

spalkovits commented 5 years ago

Hello, yes the Hub was intentionally shut down after a while. The versions I took where the following. For the Hub I used the latest release via conda from the the conda-forge channel and for the spawner the latest releases via pip. I think I will first try the hub_connect_url and see what comes out. If anyone has more suggestions I would be happy to try them. Thanks @rkdarst!

spalkovits commented 5 years ago

Hello,

setting the hub_connect_url it does not help. After a while I get a timeout. It seems that the server is not reachable? Here is the log:

[I 2019-09-22 14:56:30.803 JupyterHub app:2120](B Using Authenticator: jupyterhub.auth.PAMAuthenticator-1.0.0
[I 2019-09-22 14:56:30.804 JupyterHub app:2120](B Using Spawner: wrapspawner.wrapspawner.ProfilesSpawner
[I 2019-09-22 14:56:30.810 JupyterHub app:1257](B Loading cookie_secret from /home/palkovits/jupyterhub_cookie_secret
[I 2019-09-22 14:56:30.832 JupyterHub proxy:460](B Generating new CONFIGPROXY_AUTH_TOKEN
[W 2019-09-22 14:56:30.834 JupyterHub app:1532](B No admin users, admin interface will be unavailable.
[W 2019-09-22 14:56:30.835 JupyterHub app:1534](B Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2019-09-22 14:56:30.835 JupyterHub app:1563](B Not using whitelist. Any authenticated user will be allowed.
[I 2019-09-22 14:56:30.883 JupyterHub app:2337](B Hub API listening on http://127.0.0.1:8081/hub/
[W 2019-09-22 14:56:30.884 JupyterHub proxy:642](B Running JupyterHub without SSL.  I hope there is SSL termination happening somewhere else...
[I 2019-09-22 14:56:30.885 JupyterHub proxy:645](B Starting proxy @ http://:8000
[I 2019-09-22 14:56:31.504 JupyterHub proxy:319](B Checking routes
[I 2019-09-22 14:56:31.505 JupyterHub proxy:399](B Adding default route for Hub: / => http://127.0.0.1:8081
[I 2019-09-22 14:56:31.513 JupyterHub app:2422](B JupyterHub is now running at http://:8000
[I 2019-09-22 14:56:34.098 JupyterHub log:174](B 200 GET /hub/home (palkovits@::ffff:10.0.2.2) 86.91ms
[I 2019-09-22 14:56:35.688 JupyterHub log:174](B 200 GET /hub/spawn/palkovits (palkovits@::ffff:10.0.2.2) 24.19ms
[I 2019-09-22 14:56:38.807 JupyterHub batchspawner:188](B Spawner submitting job using sudo -E -u palkovits sbatch --parsable
[I 2019-09-22 14:56:38.807 JupyterHub batchspawner:189](B Spawner submitted script:
    #!/bin/bash
    #SBATCH --partition=compute
    #SBATCH --time=8:00:00
    #SBATCH --output=/home/palkovits/jupyterhub_slurmspawner_%j.log
    #SBATCH --job-name=jupyterhub-spawner
    #SBATCH --cpus-per-task=1
    #SBATCH --workdir=/home/palkovits
    #SBATCH --mem=800
    export PATH=/opt/conda/bin:$PATH
    jupyterhub-singleuser --port=53169

[I 2019-09-22 14:56:38.834 JupyterHub batchspawner:192](B Job submitted. cmd: sudo -E -u palkovits sbatch --parsable output: 17
[I 2019-09-22 14:56:39.897 JupyterHub batchspawner:330](B Notebook server job 17 started at bionic:53169
[I 2019-09-22 14:56:40.203 JupyterHub log:174](B 200 GET /hub/api (@127.0.0.1) 0.94ms
[I 2019-09-22 14:56:40.235 JupyterHub log:174](B 200 POST /hub/api/users/palkovits/activity (palkovits@127.0.0.1) 24.62ms
[W 2019-09-22 14:56:48.751 JupyterHub base:932](B User palkovits is slow to become responsive (timeout=10)
[I 2019-09-22 14:56:48.784 JupyterHub log:174](B 302 POST /hub/spawn/palkovits -> /hub/spawn-pending/palkovits (palkovits@::ffff:10.0.2.2) 10068.71ms
[I 2019-09-22 14:56:48.798 JupyterHub pages:303](B palkovits is pending spawn
[I 2019-09-22 14:56:48.808 JupyterHub log:174](B 200 GET /hub/spawn-pending/palkovits (palkovits@::ffff:10.0.2.2) 18.55ms
[I 2019-09-22 15:01:31.521 JupyterHub proxy:319](B Checking routes
[I 2019-09-22 15:01:40.933 JupyterHub log:174](B 200 POST /hub/api/users/palkovits/activity (palkovits@127.0.0.1) 14.01ms
[I 2019-09-22 15:06:11.162 JupyterHub log:174](B 200 POST /hub/api/users/palkovits/activity (palkovits@127.0.0.1) 17.56ms
[I 2019-09-22 15:06:31.515 JupyterHub proxy:319](B Checking routes
[I 2019-09-22 15:11:13.126 JupyterHub log:174](B 200 POST /hub/api/users/palkovits/activity (palkovits@127.0.0.1) 17.98ms
[I 2019-09-22 15:11:31.518 JupyterHub proxy:319](B Checking routes
[I 2019-09-22 15:15:51.924 JupyterHub log:174](B 200 POST /hub/api/users/palkovits/activity (palkovits@127.0.0.1) 16.99ms
[W 2019-09-22 15:16:28.000 JupyterHub user:678](B palkovits's server never showed up at http://bionic:53169/user/palkovits/ after 1200 seconds. Giving up
[I 2019-09-22 15:16:28.029 JupyterHub batchspawner:342](B Stopping server job 17
[I 2019-09-22 15:16:28.030 JupyterHub batchspawner:233](B Cancelling job 17: sudo -E -u palkovits scancel 17
[E 2019-09-22 15:16:30.144 JupyterHub gen:599](B Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /opt/conda/lib/python3.7/site-packages/jupyterhub/handlers/base.py:800> exception=TimeoutError("Server at http://bionic:53169/user/palkovits/ didn't respond in 1200 seconds")> after timeout
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 593, in error_callback
        future.result()
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 807, in finish_user_spawn
        await spawn_future
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/user.py", line 654, in spawn
        await self._wait_up(spawner)
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/user.py", line 701, in _wait_up
        raise e
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/user.py", line 669, in _wait_up
        http=True, timeout=spawner.http_timeout, ssl_context=ssl_context
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/utils.py", line 234, in wait_for_http_server
        timeout=timeout,
      File "/opt/conda/lib/python3.7/site-packages/jupyterhub/utils.py", line 177, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Server at http://bionic:53169/user/palkovits/ didn't respond in 1200 seconds

[I 2019-09-22 15:16:30.165 JupyterHub log:174](B 200 GET /hub/api/users/palkovits/server/progress (palkovits@::ffff:10.0.2.2) 1181208.19ms
[I 2019-09-22 15:16:31.519 JupyterHub proxy:319](B Checking routes
[C 2019-09-22 15:18:16.413 JupyterHub app:2448] Received signal SIGINT, initiating shutdown...
[I 2019-09-22 15:18:16.417 JupyterHub app:2155](B Cleaning up single-user servers...
[I 2019-09-22 15:18:16.420 JupyterHub proxy:705](B Cleaning up proxy[2032]...
[I 2019-09-22 15:18:16.426 JupyterHub app:2187](B ...done

Regards...

rkdarst commented 5 years ago

Just wondering how you were able to solve it... always useful for the future.

spalkovits commented 5 years ago

Sorry I just hit the wrong button. My fault. Its still not working.

rkdarst commented 5 years ago

Try the latest batchspawner + wrapspawner from git. At least latest wrapspawner is needed. The last batchspawner release is probably out of date too, enough where I wouldn't trust it and don't remember what may have changed. Releases don't effectively happen yet.

The new batchspawner should use a singleuser command "batchspawner-singleuser" instead of "jupyterhub-singleuser" (this should be automatic). Just a way to see if newer batchspawner is doing what it should, but it didn't exist at the time of the last release.

spalkovits commented 5 years ago

Hello, the latest version from git worked. My VM behaves now like it should. On my production system I had to add the solution from this issue https://github.com/jupyterhub/jupyterhub/issues/774#issuecomment-249420931 Thank you for now, I will close this for the time being.