jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

Setting up JupyterHub with HTCondor #125

Closed nifuki closed 6 years ago

nifuki commented 6 years ago

Hi,

I'm trying to make batchspawner work with HTCondor but I'm stuck with the following error:

[I 2018-11-09 13:35:39.816 JupyterHub batchspawner:242] Spawner submitting job using sudo -i -u testuser condor_submit
[I 2018-11-09 13:35:39.816 JupyterHub batchspawner:243] Spawner submitted script:

    Executable = /bin/sh
    RequestMemory = 4gb
    RequestCpus = 1
    Arguments = "-c 'exec batchspawner-singleuser --ip=""0.0.0.0""'"
    Remote_Initialdir = /home/testuser
    Output = /home/testuser/.jupyterhub.condor.out
    Error = /home/testuser/.jupyterhub.condor.err
    ShouldTransferFiles = False
    GetEnv = True
    Universe = vanilla
    Queue

[I 2018-11-09 13:35:40.119 JupyterHub batchspawner:246] Job submitted. cmd: sudo -i -u testuser condor_submit output: Submitting job(s).
    1 job(s) submitted to cluster 19.
[D 2018-11-09 13:35:40.120 JupyterHub batchspawner:269] Spawner querying job: sudo -i -u testuser condor_q 19 -format "%s, " JobStatus -format "%s, " RemoteHost -format "
    " True
[E 2018-11-09 13:35:40.356 JupyterHub batchspawner:215] Subprocess returned exitcode 1
[E 2018-11-09 13:35:40.357 JupyterHub batchspawner:216] Stdout:
[E 2018-11-09 13:35:40.357 JupyterHub batchspawner:217] b''
[E 2018-11-09 13:35:40.357 JupyterHub batchspawner:218] Stderr:
[E 2018-11-09 13:35:40.357 JupyterHub batchspawner:219] Error: -format requires format and attribute parameters
[E 2018-11-09 13:35:40.357 JupyterHub batchspawner:274] Error querying job 19
[W 2018-11-09 13:35:40.358 JupyterHub batchspawner:372] Job  neither pending nor running.

[E 2018-11-09 13:35:40.359 JupyterHub user:477] Unhandled error starting testuser's server: The Jupyter batch job has disappeared while pending in the queue or died immediately after starting.
[D 2018-11-09 13:35:40.373 JupyterHub user:578] Deleting oauth client jupyterhub-user-testuser
[E 2018-11-09 13:35:40.410 JupyterHub web:1670] Uncaught exception GET /hub/user/testuser/ (159.93.40.25)
    HTTPServerRequest(protocol='http', host='jupyterhub.jinr.ru', method='GET', uri='/hub/user/testuser/', version='HTTP/1.1', remote_ip='159.93.40.25')
    Traceback (most recent call last):
      File "/usr/share/anaconda3/lib/python3.7/site-packages/tornado/web.py", line 1592, in _execute
        result = yield result
      File "/usr/share/anaconda3/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 1052, in get
        await self.spawn_single_user(user)
      File "/usr/share/anaconda3/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 705, in spawn_single_user
        timedelta(seconds=self.slow_spawn_timeout), finish_spawn_future
      File "/usr/share/anaconda3/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/share/anaconda3/lib/python3.7/site-packages/jupyterhub/user.py", line 489, in spawn
        raise e
      File "/usr/share/anaconda3/lib/python3.7/site-packages/jupyterhub/user.py", line 409, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
      File "/usr/share/anaconda3/lib/python3.7/site-packages/batchspawner/batchspawner.py", line 373, in start
        raise RuntimeError('The Jupyter batch job has disappeared'
    RuntimeError: The Jupyter batch job has disappeared while pending in the queue or died immediately after starting.

The condor_q command succeeds if ran manually:

# sudo -i -u testuser condor_q 19 -format "%s, " JobStatus -format "%s, " RemoteHost -format "\n" True
1,

# echo $?
0

I'm using the latest batchspawner (from the master):

# pip list |grep batchspawner
batchspawner                       0.9.0.dev0

And the spawner configuration:

c.JupyterHub.spawner_class = 'batchspawner.CondorSpawner'
c.Spawner.http_timeout = 120

c.BatchSpawnerBase.req_nprocs = '1'
c.BatchSpawnerBase.req_memory = '1gb'
c.BatchSpawnerBase.req_runtime = '12:00:00'

c.CondorSpawner.exec_prefix = 'sudo -i -u {username}'

What can be the cause of this error?

Thanks

nifuki commented 6 years ago

I think I figured it out: it is due to -format "\n" true on this line, removing it jobs now get monitored.

nifuki commented 6 years ago

Still struggling with making it work: the job is submitted now, but the server fails to start. Here is the command which is executed on the node: /bin/sh -c exec' '/usr/share/miniconda3/bin/batchspawner-singleuser' '--ip="0.0.0.0"

And in .jupyterhub.condor.err I only see: JUPYTERHUB_API_TOKEN env is required to run jupyterhub-singleuser. Did you launch it manually?

Maybe the environment is not set properly, but I can't figure out how to set this up.

Any help appreciated.

Thanks.

nifuki commented 6 years ago

Found the problem: sudo wasn't passing the environment variables. I changed the exec_prefix to sudo -E -u {username} and it now works. So, I'm closing the issue.

mbmilligan commented 6 years ago

Ok, thanks for the followup. Quick question: had you changed the exec_prefix in your configuration file? sudo -E is supposed to be in the default prefix, so it would be helpful to know if that default setting is getting broken somehow.

Thanks!

loadnabox commented 5 years ago

I know I'm a bit late to this party but I'm having a very similar issue

I'm trying to run a centralized JupyterHub server for all users so I'm executing it as root. Our environment very carefully set environment variables on login. If those variables are changed jobs do not run properly.

So I'm stuck because using -E passes the JUPYTERHUB_API_TOKEN env properly, but nothing will run (including batchspawner-singleuser will not run) because the commands to load needed packages and modules is broken with the overridden env.

If I use -i The packages and modules load properly, however the JUPYTERHUB_API_TOKEN env is no longer passed and the compute node fails to connect or be registered by jupyterhub.

Advice would be greatly appreciated

heavenkong commented 5 years ago

The same issue with me. How did you solve this problem?