Error: Unable to parse job ID from text

Hoeze commented 5 years ago

Hi, I've got a slurm cluster with Kerberos and I'm running jupyterhub as non-root. My jupyterhub_config file looks like this:

c.JupyterHub.authenticator_class = 'jupyterhub.auth.PAMAuthenticator'
c.PAMAuthenticator.open_sessions = False

#------------------------------------------------------------------------------
# Spawner config
#------------------------------------------------------------------------------
# We're using SLURM:
# - use 'batchspawner.SlurmSpawner' as spawner class
# - use 'wrapspawner.ProfileSpawner' to provide multiple spawner profiles which 
#   overwrite the default values
# 

c.JupyterHub.spawner_class = 'batchspawner.SlurmSpawner'
#c.Spawner.http_timeout = 120
c.BatchSpawnerBase.req_nprocs = '2'
#c.BatchSpawnerBase.req_queue = 'jupyter'
#c.BatchSpawnerBase.req_host = 'mesabi.xyz.edu'
c.BatchSpawnerBase.req_runtime = '12:00:00'
c.BatchSpawnerBase.req_memory = '12gb'

# in slurm queues are called 'partitions'
c.SlurmSpawner.req_partition = 'jupyter'

c.SlurmSpawner.start_timeout = 240

c.SlurmSpawner.batch_script = '''#!/bin/bash
#SBATCH --partition={partition}
#SBATCH --time={runtime}
#SBATCH --output={homedir}/logs/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub-spawner
#SBATCH --cpus-per-task={nprocs}
#SBATCH --workdir={homedir}
#SBATCH --mem={memory}
#SBATCH {options}

# load necessary modules
module load anaconda/3-2018.12

# print environment
env

{cmd}
'''

c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
#c.Spawner.http_timeout = 120

c.ProfilesSpawner.profiles = [
  (
    'SLURM CPU only node - 2 core, 12 GB, 48 hours',
    'slurm-2cpu',
    'batchspawner.SlurmSpawner',
    dict(
      req_nprocs='2',
      req_partition='slurm-cpu',
      req_runtime='48:00:00',
      req_memory='12000'
    )
  ),
...
  (
    "Test server",
    'local-test',
    'jupyterhub.spawner.LocalProcessSpawner',
    {
      'ip':'0.0.0.0'
    }
  )
 ]

However, I cannot spawn sessions due to an error with parsing auks output:

[I 2019-07-07 03:08:22.789 JupyterHub batchspawner:188] Spawner submitting job using sudo -E -u <username> sbatch --parsable
[I 2019-07-07 03:08:22.789 JupyterHub batchspawner:189] Spawner submitted script:
#!/bin/bash
#SBATCH --partition=slurm-cpu
#SBATCH --time=48:00:00
#SBATCH --output=/home/<username>/logs/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyterhub-spawner
#SBATCH --cpus-per-task=2
#SBATCH --workdir=/home/<username>
#SBATCH --mem=12000
#SBATCH
####SBATCH --uid=<username>
# load necessary modules
module load anaconda/3-2018.12
# print environment
env
jupyterhub-singleuser --port=45147 --NotebookApp.default_url=/lab
Sun Jul  7 03:08:22 2019 [INFO4] [euid=5534,pid=5818] auks_krb5_cred: kerberos authentication context successfully initialized
Sun Jul  7 03:08:23 2019 [INFO4] [euid=5534,pid=5818] auks_api: auks cred added using default file
132826
[E 2019-07-07 03:08:23.061 JupyterHub batchspawner:509] SlurmSpawner unable to parse job ID from text: Sun Jul  7 03:08:22 2019 [INFO4] [euid=5534,pid=5818] auks_krb5_cred: kerberos context successfully initialized
Sun Jul  7 03:08:23 2019 [INFO4] [euid=5534,pid=5818] auks_api: auks cred added using default file
132826
[E 2019-07-07 03:08:23.061 JupyterHub batchspawner:195] Job submission failed with exit code Sun Jul  7 03:08:22 2019 [INFO4] [euid=5534,pid=5818] auks_krb5_cred: kerberos context successfully initialized
Sun Jul  7 03:08:23 2019 [INFO4] [euid=5534,pid=5818] auks_api: auks cred added using default file
132826
[E 2019-07-07 03:08:23.062 JupyterHub user:626] Unhandled error starting <username>'s server:
[E 2019-07-07 03:08:23.081 JupyterHub pages:209] Failed to spawn single-user server with form
Traceback (most recent call last):
File "/opt/modules/anaconda/3-2018.12/lib/python3.7/site-packages/jupyterhub/handlers/pages.py", line 206, in post
await self.spawn_single_user(user, server_name=server_name, options=options)
File "/opt/modules/anaconda/3-2018.12/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 889, in spawn_single_user
timedelta(seconds=self.slow_spawn_timeout), finish_spawn_future
File "/opt/modules/anaconda/3-2018.12/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 807, in finish_user_spawn
await spawn_future
File "/opt/modules/anaconda/3-2018.12/lib/python3.7/site-packages/jupyterhub/user.py", line 642, in spawn
raise e
File "/opt/modules/anaconda/3-2018.12/lib/python3.7/site-packages/jupyterhub/user.py", line 546, in spawn
url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
File "/opt/modules/anaconda/3-2018.12/lib/python3.7/site-packages/batchspawner/batchspawner.py", line 309, in start
assert len(self.job_id) > 0
AssertionError

Nevertheless, the test-server config works flawlessly.

rkdarst commented 5 years ago

To me the key line is this:

[E 2019-07-07 03:08:23.061 JupyterHub batchspawner:509] SlurmSpawner unable to parse job ID from text: Sun Jul  7 03:08:22 2019 [INFO4] [euid=5534,pid=5818] auks_krb5_cred: kerberos context successfully initialized
Sun Jul  7 03:08:23 2019 [INFO4] [euid=5534,pid=5818] auks_api: auks cred added using default file
132826

Which tells me that the output from sbatch --parseable is this (note the embedded newlines):

Sun Jul  7 03:08:22 2019 [INFO4] [euid=5534,pid=5818] auks_krb5_cred: kerberos context successfully initialized
Sun Jul  7 03:08:23 2019 [INFO4] [euid=5534,pid=5818] auks_api: auks cred added using default file
132826

The last line is the JobID. The method parse_job_id tries to convert this to just 132826 but fails because the krb stuff outputs other messages which aren't expected.

I could make a change but it wouldn't be accepted anytime soon, so I recommend that you modify that SlurmSpawner.parse_job_idmethod to find the job ID on the last line of the output. It would be a useful change to send as a PR, too, I'll do it when I can (next week).

Let me know if this helps.

Hoeze commented 5 years ago

Thanks for the explanation @rkdarst.

You're right, sbatch --parsable still contains debug info. I've opened a bug for that issue on schedmd.

As a workaround, I'll try to ignore everything except the last line of the sbatch --parsable output.

qlux commented 4 years ago

One easy thing we did to circumvent this was to change the default exec_prefix for a small wrapper script:

Spawner.exec_prefix = Unicode("/etc/jupyterhub/jupy_sudo {username}").tag(config=True)

and inside the script you return the default sudo command for squeue/scancel and tweak sbatch:

#!/bin/bash
USER=$1
shift;
SLURM_BIN=/opt/slurm/bin
CMD=$1
shift;
if [ "$CMD" == "sbatch" ];then
        sudo -E -u $USER $SLURM_BIN/sbatch --parsable  (< in your case --auks | tail -n 1)
else
        #squeue/scancel
        sudo -E -u $USER $SLURM_BIN/$CMD "$@"
fi

I don't know about your setup but the machine where jupyterhub is running may not have a shell accessible to the user so they should not have any ticket to send so you could just disable auks and let it use the last good known ticket from auksd with --auks=done

rcthomas commented 4 years ago

We do this in our deployment a lot so it seems to be a pattern, the trick is managing these additional custom scripts. We have multiple Slurm clusters, and to get the "external system" ones, we need to load a module (module load esslurm) before doing any Slurm commands. So they all get wrapped.

jupyterhub / batchspawner

Error: Unable to parse job ID from text #148