Open Hoeze opened 5 years ago
To me the key line is this:
[E 2019-07-07 03:08:23.061 JupyterHub batchspawner:509] SlurmSpawner unable to parse job ID from text: Sun Jul 7 03:08:22 2019 [INFO4] [euid=5534,pid=5818] auks_krb5_cred: kerberos context successfully initialized
Sun Jul 7 03:08:23 2019 [INFO4] [euid=5534,pid=5818] auks_api: auks cred added using default file
132826
Which tells me that the output from sbatch --parseable
is this (note the embedded newlines):
Sun Jul 7 03:08:22 2019 [INFO4] [euid=5534,pid=5818] auks_krb5_cred: kerberos context successfully initialized
Sun Jul 7 03:08:23 2019 [INFO4] [euid=5534,pid=5818] auks_api: auks cred added using default file
132826
The last line is the JobID. The method parse_job_id
tries to convert this to just 132826
but fails because the krb stuff outputs other messages which aren't expected.
I could make a change but it wouldn't be accepted anytime soon, so I recommend that you modify that SlurmSpawner.parse_job_id
method to find the job ID on the last line of the output. It would be a useful change to send as a PR, too, I'll do it when I can (next week).
Let me know if this helps.
Thanks for the explanation @rkdarst.
You're right, sbatch --parsable
still contains debug info.
I've opened a bug for that issue on schedmd.
As a workaround, I'll try to ignore everything except the last line of the sbatch --parsable
output.
One easy thing we did to circumvent this was to change the default exec_prefix for a small wrapper script:
Spawner.exec_prefix = Unicode("/etc/jupyterhub/jupy_sudo {username}").tag(config=True)
and inside the script you return the default sudo command for squeue/scancel and tweak sbatch:
#!/bin/bash
USER=$1
shift;
SLURM_BIN=/opt/slurm/bin
CMD=$1
shift;
if [ "$CMD" == "sbatch" ];then
sudo -E -u $USER $SLURM_BIN/sbatch --parsable (< in your case --auks | tail -n 1)
else
#squeue/scancel
sudo -E -u $USER $SLURM_BIN/$CMD "$@"
fi
I don't know about your setup but the machine where jupyterhub is running may not have a shell accessible to the user so they should not have any ticket to send so you could just disable auks and let it use the last good known ticket from auksd with --auks=done
We do this in our deployment a lot so it seems to be a pattern, the trick is managing these additional custom scripts. We have multiple Slurm clusters, and to get the "external system" ones, we need to load a module (module load esslurm
) before doing any Slurm commands. So they all get wrapped.
Hi, I've got a slurm cluster with Kerberos and I'm running jupyterhub as non-root. My jupyterhub_config file looks like this:
However, I cannot spawn sessions due to an error with parsing
auks
output:Nevertheless, the test-server config works flawlessly.