jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
182 stars 130 forks source link

Running a singularity image container with SLURM and batchspawner #243

Open geninv opened 2 years ago

geninv commented 2 years ago

Bug description

We are trying to launch a singularity image container with SLURM. Jupyterhub is installed in a virtual machine and launch the singularity image containing jupyterlab in a job. The slurm job is correctly launched but it encounters an error before the process is created inside the SLURM job.
From what we can read in the logs, it seems that batchspawner is expecting a python script to launch, but the command line created use the singularity binary.

Something to note is that batchspawner worked with singularity in 0.8.2 but not in version 1.1.0. We think that it's because the batchspawner wrapper is waiting for a python script. Do you think it could work if we wrap the call to the singularity binary with a python script ? Or is there some other way to make them work together ?

Expected behaviour

The job is launched and we get access to the jupyterlab inside the singularity image.

Actual behaviour

The job encounters an error. We get a python error in the slurm logs :

Traceback (most recent call last):
  File "/softs/rh7/conda-envs/pangeo_latest/bin/batchspawner-singleuser", line 6, in <module>
    main()
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/site-packages/batchspawner/singleuser.py", line 23, in main
    run_path(cmd_path, run_name="__main__")
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 269, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 244, in _get_code_from_file
    code = compile(f.read(), fname, 'exec')
ValueError: source code string cannot contain null bytes
srun: error: node539: task 0: Exited with exit code 1

How to reproduce

Request a job running a singularity image using batchspawner.

Configuration ```python # jupyterhub_config.py c.JupyterHub.authenticator_class = 'ldapauthenticator.LDAPAuthenticator' c.JupyterHub.bind_url = 'http://127.0.0.1:8000' c.JupyterHub.cleanup_servers = False c.JupyterHub.db_url = 'sqlite:////etc/jupyterhub/jupyterhub.sqlite' c.JupyterHub.hub_ip = '0.0.0.0' c.JupyterHub.hub_port = import batchspawner c.JupyterHub.spawner_class = 'wrapspawner.ProfilesCmdSpawner' c.Spawner.http_timeout = 120 c.BatchSpawnerBase.req_nprocs = '1' c.BatchSpawnerBase.req_runtime = '12:00:00' c.BatchSpawnerBase.req_memory = '4000mb' c.BatchSpawnerBase.req_prologue = ''' source ~/.bashrc export JUPYTER_PATH=$JUPYTER_PATH:/softs/rh7/jupyter_kernels/ export PS1='hub-[\\u@\\h \\W]\\$' module load latex echo "INFO | Using default notebook env : pangeo_latest" module load conda conda activate /softs/rh7/conda-envs/pangeo_latest unset PKG_CONFIG_PATH unset PYTHONPATH ''' c.BatchSpawnerBase.req_queue = 'qdev' c.BatchSpawnerBase.exec_prefix = 'sudo -E -u {username} env PATH=$PATH' c.SlurmSpawner.batch_script = '''#!/bin/sh #SBATCH --output={{homedir}}/jupyterhub_slurmspawner_%j.log #SBATCH --job-name=spawner-jupyterhub #SBATCH --chdir={{homedir}} #SBATCH --export=ALL #SBATCH --get-user-env=L {% if partition %}#SBATCH --partition={{partition}} {% endif %}{% if runtime %}#SBATCH --time={{runtime}} {% endif %}{% if memory %}#SBATCH --mem={{memory}} {% endif %}{% if gres %}#SBATCH --gres={{gres}} {% endif %}{% if nprocs %}#SBATCH --cpus-per-task={{nprocs}} {% endif %}{% if reservation%}#SBATCH --reservation={{reservation}} {% endif %}{% if options %}#SBATCH {{options}}{% endif %} trap 'echo SIGTERM received' TERM {{prologue}} {% if srun %}srun {% endif %}{{cmd}} echo "jupyterhub-singleuser ended gracefully" {{epilogue}} ''' c.ProfilesSpawner.profiles = [ ('Standard (visu) - 1 core, 5 GB, 1 week -- Default', 'qnotebook1c5g', 'batchspawner.SlurmSpawner', dict(req_nprocs='1', req_queue='qnotebook', req_runtime='168:00:00', req_memory='5GB')), ('Standard (visu) - 4 cores, 20 GB, 1 week', 'qnotebook4c20g', 'batchspawner.SlurmSpawner', dict(req_nprocs='4', req_queue='qnotebook', req_runtime='168:00:00', req_memory='20GB')), ('Qdev - 1 cores, 4 GB, 12 hours', 'qdev1c4g', 'batchspawner.SlurmSpawner', dict(req_nprocs='1', req_queue='qdev', req_memory='4GB')), ('Qdev - 4 cores, 15 GB, 12 hours', 'qdev4c15g', 'batchspawner.SlurmSpawner', dict(req_nprocs='4', req_queue='qdev', req_memory='15GB')), ('Qdev full node - 16 cores, 60GB', 'qdevfull', 'batchspawner.SlurmSpawner', dict(req_nprocs='16', req_queue='qdev', req_memory='60GB')), ('Batch - 1 cores, 5 GB, 12 hours', 'batch1c5g12h', 'batchspawner.SlurmSpawner', dict(req_nprocs='1', req_queue='batch', req_runtime='12:00:00', req_memory='5GB')), ('Batch - 1 cores, 5 GB, 72 hours', 'batch1c5g12h', 'batchspawner.SlurmSpawner', dict(req_nprocs='1', req_queue='batch', req_runtime='72:00:00', req_memory='5GB')), ('Batch - 4 cores, 20 GB, 12 hours', 'batch4c20g12h', 'batchspawner.SlurmSpawner', dict(req_nprocs='4', req_queue='batch', req_runtime='12:00:00', req_memory='20GB')), ('Batch full node - 24 cores, 120 GB, 12 hours', 'batchfull12h', 'batchspawner.SlurmSpawner', dict(req_nprocs='24', req_queue='batch', req_runtime='12:00:00', req_memory='120GB')), ('Batch 2019 full node - 40 cores, 184 GB, 12 hours', 'batch2019full12h', 'batchspawner.SlurmSpawner', dict(req_nprocs='40', req_queue='batch', req_runtime='12:00:00', req_memory='184GB')), ('GPGPU - 1 gpgpu T4 -- Default to use for GPU, 8 cores, 92 GB, 4 hours', 'gpu4h', 'batchspawner.SlurmSpawner', dict(req_nprocs='8', req_queue='qgpgpudev', req_runtime='04:00:00', req_memory='92GB')) ] SINGULARITY_BIND_OPTS = "$HOME:$HOME,/work/scratch/$USER:/scratch,/softs:/softs,/work:/work,/datalake:/datalake,/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem,/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt" c.ProfilesCmdSpawner.env_list = [ ('Default lab environnement - without VRE (all groups)', 'jupyter-labhub'), ('VRECNES (vrecnes group only)', f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/datalabs/images/souche/vrecnes-stable.simg --notebook-dir=$HOME'), ('VREOT (vreot group only)', f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/datalabs/images/thematique/OT/vreot-stable.simg --notebook-dir=$HOME'), ('VREOT (All kernels, vreot group only)', f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/datalabs/images/thematique/OT/vreot-all_kernels.simg --notebook-dir=$HOME'), ('VREAI4GEO (ai4geo group only)', f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/ai4geo/singularity/vreai4geo-stable.simg --notebook-dir=$HOME'), ('VRECESWOT (swotce_exp group only)', f'{SINGULARITY_BIN} run --nv --add-caps CAP_NET_BIND_SERVICE --bind {SINGULARITY_BIND_OPTS} /softs/projets/swotce/singularity/exp/vreceswot-stable.simg --notebook-dir=$HOME') ] c.JupyterHub.pid_file = '/etc/jupyterhub/pid' c.JupyterHub.services = [ { "name": "service-token", "admin": True, "api_token": "", }, ] c.Spawner.cmd = ['${JUPYTERHUB_SINGLEUSER_CMD:-jupyter-labhub}'] c.Spawner.default_url = '/lab' c.Spawner.ip = '0.0.0.0' c.Spawner.poll_interval = 120 ```
Logs ``` # Log Jupyterhub Apr 6 09:46:51 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:51.113 JupyterHub log:189] 200 GET /hub/home (@XX.XX.XX.XX) 83.27ms Apr 6 09:46:52 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:52.705 JupyterHub log:189] 200 GET /hub/spawn/XX (@XX.XX.XX.XX) 8.73ms Apr 6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.575 JupyterHub roles:477] Adding role server to token: Apr 6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.592 JupyterHub provider:607] Creating oauth client jupyterhub-user-XX Apr 6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.644 JupyterHub batchspawner:262] Spawner submitting job using sudo -E -u XX env PATH=$PATH sbatch --parsable Apr 6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.644 JupyterHub batchspawner:263] Spawner submitted script: Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #!/bin/sh Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --output=/home/XX/jupyterhub_slurmspawner_%j.log Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --job-name=spawner-jupyterhub Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --chdir=/home/XX Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --export=ALL Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --get-user-env=L Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --time=168:00:00 Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --mem=5GB Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #SBATCH --cpus-per-task=1 Apr 6 09:46:57 tu-juphub-q01 jupyterhub: trap 'echo SIGTERM received' TERM Apr 6 09:46:57 tu-juphub-q01 jupyterhub: source ~/.bashrc Apr 6 09:46:57 tu-juphub-q01 jupyterhub: # Environnements hub par défaut Apr 6 09:46:57 tu-juphub-q01 jupyterhub: # export JUPYTER_PATH=$JUPYTER_PATH:/work/logiciels/rh7/Python/jupyter_data/share/jupyter/ Apr 6 09:46:57 tu-juphub-q01 jupyterhub: export JUPYTER_PATH=$JUPYTER_PATH:/softs/rh7/jupyter_kernels/ Apr 6 09:46:57 tu-juphub-q01 jupyterhub: # Ci-dessous PS1 est exporté pour permettre l'utilisation de LateX dans un job Apr 6 09:46:57 tu-juphub-q01 jupyterhub: # non interactif Apr 6 09:46:57 tu-juphub-q01 jupyterhub: export PS1='hub-[\u@\h \W]\$' Apr 6 09:46:57 tu-juphub-q01 jupyterhub: module load latex Apr 6 09:46:57 tu-juphub-q01 jupyterhub: # echo "INFO | Using default notebook env: pangeo_202106" Apr 6 09:46:57 tu-juphub-q01 jupyterhub: # module load conda Apr 6 09:46:57 tu-juphub-q01 jupyterhub: #conda activate /softs/rh7/conda-envs/pangeo_202106 Apr 6 09:46:57 tu-juphub-q01 jupyterhub: echo "INFO | Using default notebook env : pangeo_latest" Apr 6 09:46:57 tu-juphub-q01 jupyterhub: module load conda Apr 6 09:46:57 tu-juphub-q01 jupyterhub: conda activate /softs/rh7/conda-envs/pangeo_latest Apr 6 09:46:57 tu-juphub-q01 jupyterhub: unset PKG_CONFIG_PATH Apr 6 09:46:57 tu-juphub-q01 jupyterhub: unset PYTHONPATH Apr 6 09:46:57 tu-juphub-q01 jupyterhub: srun batchspawner-singleuser ${JUPYTERHUB_SINGLEUSER_CMD:-/softs/rh7/singularity/3.5.3/bin/singularity run --nv --add-caps CAP_NET_BIND_SERVICE --bind $HOME:$HOME,/work/scratch/$USER:/scratch,/softs:/softs,/work:/work,/datalake:/datalake,/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem,/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt /softs/projets/datalabs/images/souche/vrecnes-stable.simg --notebook-dir=$HOME} Apr 6 09:46:57 tu-juphub-q01 jupyterhub: echo "jupyterhub-singleuser ended gracefully" Apr 6 09:46:57 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:57.799 JupyterHub batchspawner:266] Job submitted. cmd: sudo -E -u XX env PATH=$PATH sbatch --parsable output: 975 Apr 6 09:46:58 tu-juphub-q01 jupyterhub: [W 2022-04-06 09:46:58.570 JupyterHub base:187] Rolling back dirty objects IdentitySet([]) Apr 6 09:46:58 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:58.590 JupyterHub log:189] 302 POST /hub/spawn/XX -> /hub/spawn-pending/XX (@XX.XX.XX.XX) 1011.48ms Apr 6 09:46:58 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:58.653 JupyterHub pages:400] XX is pending spawn Apr 6 09:46:58 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:46:58.659 JupyterHub log:189] 200 GET /hub/spawn-pending/XX (@XX.XX.XX.XX) 13.75ms Apr 6 09:47:07 tu-juphub-q01 jupyterhub: [W 2022-04-06 09:47:07.570 JupyterHub base:1043] User XX is slow to start (timeout=10) Apr 6 09:47:15 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:47:15.125 JupyterHub log:189] 200 POST /hub/api/batchspawner (@XX.XX.XX.XX) 17.82ms Apr 6 09:47:15 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:47:15.573 JupyterHub batchspawner:419] Notebook server job 975 started at node539:52266 Apr 6 09:49:04 tu-juphub-q01 jupyterhub: [W 2022-04-06 09:49:04.585 JupyterHub user:811] XX server never showed up at http://node539:52266/user/XX/ after 120 seconds. Giving up. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: Common causes of this timeout, and debugging tips: Apr 6 09:49:04 tu-juphub-q01 jupyterhub: 1. The server didn't finish starting, Apr 6 09:49:04 tu-juphub-q01 jupyterhub: or it crashed due to a configuration issue. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: Check the single-user server's logs for hints at what needs fixing. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: 2. The server started, but is not accessible at the specified URL. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: This may be a configuration issue specific to your chosen Spawner. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: Check the single-user server logs and resource to make sure the URL Apr 6 09:49:04 tu-juphub-q01 jupyterhub: is correct and accessible from the Hub. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: 3. (unlikely) Everything is working, but the server took too long to respond. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: To fix: increase `Spawner.http_timeout` configuration Apr 6 09:49:04 tu-juphub-q01 jupyterhub: to a number of seconds that is enough for servers to become responsive. Apr 6 09:49:04 tu-juphub-q01 jupyterhub: [E 2022-04-06 09:49:04.791 JupyterHub gen:623] Exception in Future .finish_user_spawn() done, defined at /srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/handlers/base.py:934> exception=TimeoutError("Server at http://node539:52266/user/XX/ didn't respond in 120 seconds")> after timeout Apr 6 09:49:04 tu-juphub-q01 jupyterhub: Traceback (most recent call last): Apr 6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/tornado/gen.py", line 618, in error_callback Apr 6 09:49:04 tu-juphub-q01 jupyterhub: future.result() Apr 6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/handlers/base.py", line 941, in finish_user_spawn Apr 6 09:49:04 tu-juphub-q01 jupyterhub: await spawn_future Apr 6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/user.py", line 792, in spawn Apr 6 09:49:04 tu-juphub-q01 jupyterhub: await self._wait_up(spawner) Apr 6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/user.py", line 836, in _wait_up Apr 6 09:49:04 tu-juphub-q01 jupyterhub: raise e Apr 6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/user.py", line 806, in _wait_up Apr 6 09:49:04 tu-juphub-q01 jupyterhub: resp = await server.wait_up( Apr 6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/utils.py", line 241, in wait_for_http_server Apr 6 09:49:04 tu-juphub-q01 jupyterhub: re = await exponential_backoff( Apr 6 09:49:04 tu-juphub-q01 jupyterhub: File "/srv/conda-2022-04-06-09-20-34/lib/python3.8/site-packages/jupyterhub/utils.py", line 189, in exponential_backoff Apr 6 09:49:04 tu-juphub-q01 jupyterhub: raise asyncio.TimeoutError(fail_message) Apr 6 09:49:04 tu-juphub-q01 jupyterhub: asyncio.exceptions.TimeoutError: Server at http://node539:52266/user/XX/ didn't respond in 120 seconds Apr 6 09:49:04 tu-juphub-q01 jupyterhub: [I 2022-04-06 09:49:04.796 JupyterHub log:189] 200 GET /hub/api/users/geninv/server/progress XX@XX.XX.XX.XX 125979.14ms ``` ``` # SLURM log Traceback (most recent call last): File "/softs/rh7/conda-envs/pangeo_latest/bin/batchspawner-singleuser", line 6, in main() File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/site-packages/batchspawner/singleuser.py", line 23, in main run_path(cmd_path, run_name="__main__") File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 269, in run_path code, fname = _get_code_from_file(run_name, path_name) File "/softs/rh7/conda-envs/pangeo_202202/lib/python3.9/runpy.py", line 244, in _get_code_from_file code = compile(f.read(), fname, 'exec') ValueError: source code string cannot contain null bytes srun: error: node539: task 0: Exited with exit code 1 jupyterhub-singleuser ended gracefully ```
welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

jbeal-work commented 1 year ago

I have got this working in LSF, we have to ensure that batchspawner is installed in the singularity instance if that helps ?