jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

not able to connect to spawned server with SLURM #217

Closed hoba87 closed 3 years ago

hoba87 commented 3 years ago

Bug description

I try to start a server on a SLURM queue cluster. However it is not working. So far I have set the full paths for slurm executables, and a fixed port 35000 for the spawned server. The server job is queued and started, but it seems that no communication is possible between hub and server

Expected behaviour

Starting a server in the Control Panel. Server queued, started, and hub redirected to server.

Actual behaviour

Starting a server in the Control Panel. Server queued, started. Hub stuck at "Cluster job running... waiting to connect". Server is stopped after timeout with Hub message "Spawn failed: Timeout".

How to reproduce

  1. Go to 'Control Panel'
  2. Click on 'Start my server/start custom server'
  3. Select job profile "Compute Node ...."
  4. See error

Your personal set up

absl-py==0.10.0
aes-everywhere==1.2.10
aiohttp==3.6.2
alabaster==0.7.12
alembic==1.4.2
anybadge==1.7.0
anyio==3.0.0
appdirs==1.4.4
argon2-cffi==20.1.0
ase==3.21.1
astroid==2.5.1
astunparse==1.6.3
async-generator==1.10
async-timeout==3.0.1
attrs==20.1.0
Babel==2.9.0
backcall==0.2.0
batchspawner @ file:///home/common/software/batchspawner
bcrypt==3.2.0
bleach==3.1.5
cachetools==4.1.1
cadquery==2.0
cefpython3==66.0
certifi==2020.6.20
certipy==0.1.3
cffi==1.14.2
chardet==3.0.4
click==7.1.2
colorama==0.4.3
colorlog==4.2.1
cond-rnn @ file:///home/h.badorreck/cond_rnn
coverage==5.2.1
cryptography==3.0
cycler==0.10.0
Cython==0.29.22
dataclasses-json==0.5.2
decorator==4.4.2
defusedxml==0.6.0
deprecation==2.1.0
dill==0.3.2
docutils==0.16
entrypoints==0.3
flatbuffers==1.12
future==0.18.2
gast==0.3.3
gitdb==4.0.5
GitPython==3.1.7
google-auth==1.22.0
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.32.0
h5io==0.1.2
h5json==1.1.3
h5py==2.10.0
idna==2.10
ifaddr==0.1.7
imageio==2.9.0
imagesize==1.2.0
importlib-metadata==1.7.0
iniconfig==1.0.1
ipykernel==5.3.4
ipython==7.17.0
ipython-genutils==0.2.0
ipyupload==0.1.3
ipywidgets==7.5.1
isort==5.7.0
jedi==0.17.2
Jinja2==2.11.2
joblib==0.16.0
json5==0.9.5
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==6.1.6
jupyter-console==6.1.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.5.1
jupyter-core==4.6.3
jupyter-highlight-selected-word==0.2.0
jupyter-latex-envs==1.4.6
jupyter-multiselection==0.1.1
jupyter-nbextensions-configurator==0.4.1
jupyter-packaging==0.9.2
jupyter-server==1.6.2
jupyter-telemetry==0.1.0
jupyterhub==1.4.0
jupyterlab==3.0.14
jupyterlab-server==2.4.0
jwt==1.1.0
Keras==2.4.3
Keras-Preprocessing==1.1.2
Kivy==2.0.0
Kivy-Garden==0.1.4
kiwisolver==1.2.0
lazy-object-proxy==1.4.3
libcst==0.3.17
llvmlite==0.34.0
lxml==4.5.2
Mako==1.1.3
Markdown==3.2.2
markdown-kernel==0.2.2
MarkupSafe==1.1.1
marshmallow==3.10.0
marshmallow-enum==1.5.1
matplotlib==3.0.3
matplotlib-scalebar==0.6.2
mccabe==0.6.1
mendeleev==0.6.0
meshio==4.1.0
mistune==0.8.4
moldynpy @ file:///home/h.badorreck/moldyn
molmod==1.4.8
more-itertools==8.5.0
mpi4py==3.0.3
mpmath==1.1.0
multidict==4.7.6
mypy==0.812
mypy-extensions==0.4.3
nbclassic==0.2.7
nbconvert==5.6.1
nbformat==5.0.7
netdisco==2.8.2
networkx==2.5
nose==1.3.7
notebook==6.3.0
numba==0.51.2
numexpr==2.7.1
numpy==1.18.5
oauthlib==3.1.0
opencv-python==4.4.0.44
opt-einsum==3.3.0
packaging==20.4
pamela==1.0.0
pandas==1.1.1
pandocfilters==1.4.2
paramiko==2.7.1
parso==0.7.1
pathlib2==2.3.5
periodictable==1.5.2
pexpect==4.8.0
phonopy==2.7.1
pickleshare==0.7.5
picmcpy==0.1.2
Pillow==7.2.0
pipdate==0.5.3
pluggy==0.13.1
prometheus-client==0.8.0
prompt-toolkit==3.0.6
protobuf==3.13.0
psutil==5.7.2
ptyprocess==0.6.0
py==1.9.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycairo==1.19.1
pycparser==2.20
pycryptodomex==3.9.9
pydot==1.4.2
pyenchant==3.1.1
pyfiglet==0.8.post1
pyfileindex==0.0.4
pyflakes==2.3.0
Pygments==2.6.1
pygmsh==6.1.1
PyGObject==3.36.1
pyiron==0.2.17
pylint==2.7.2
pylint-exit==1.2.0
PyNaCl==1.4.0
PyOpenGL==3.1.5
pyOpenSSL==19.1.0
pyparsing==2.4.7
pyre-check==0.9.0
pyre-extensions==0.0.21
pyrsistent==0.16.0
PySDL2==0.9.7
pysqa==0.0.10
pytest==6.0.1
python-dateutil==2.8.1
python-editor==1.0.4
python-json-logger==0.1.11
python-xlib==0.29
pytz==2020.1
pywatchman==1.4.1
PyWavelets==1.1.1
PyYAML==5.3.1
pyzmq==19.0.2
qtconsole==4.7.6
QtPy==1.9.0
QuickFF==2.2.4
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
ruamel.yaml==0.16.10
ruamel.yaml.clib==0.2.0
scandir==1.10.0
scikit-image==0.17.2
scikit-learn==0.23.2
scipy==1.5.4
seekpath==2.0.1
Send2Trash==1.5.0
six==1.15.0
smmap==3.0.4
sniffio==1.2.0
snowballstemmer==2.0.0
spglib==1.16.0
Sphinx==3.4.3
sphinx-autoapi==1.5.1
sphinx-rtd-theme==0.5.1
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
SQLAlchemy==1.3.19
stringcase==1.2.0
sympy==1.6.2
tables==3.6.1
tabulate==0.8.9
tamkin==1.2.6
tensorboard==2.4.1
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.2
tensorflow-estimator==2.3.0
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
threadpoolctl==2.1.0
tifffile==2020.8.13
toml==0.10.1
tomlkit==0.7.0
torch==1.6.0+cu101
torchvision==0.7.0+cu101
tornado==6.1
tqdm==4.48.2
traitlets==4.3.3
transforms3d==0.3.1
typed-ast==1.4.2
typing-extensions==3.7.4.3
typing-inspect==0.6.0
Unidecode==1.1.2
uPnPClient==0.0.8
urllib3==1.25.10
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.1
wrapspawner==1.0.0
wrapt==1.12.1
yaff==1.4.2
yarl==1.6.0
zeroconf==0.28.6
zipp==3.1.0

c.JupyterHub.active_server_limit = 10 
c.JupyterHub.allow_named_servers = True
c.JupyterHub.hub_connect_ip = '10.50.10.20'
c.JupyterHub.hub_ip = '192.168.0.254'
import batchspawner
c.Spawner.http_timeout = 120
c.BatchSpawnerBase.req_nprocs = '1'
c.BatchSpawnerBase.req_queue = 'batch'
c.BatchSpawnerBase.req_partition = 'batch'
c.BatchSpawnerBase.req_host = 'lk-pma-cluster-head'
c.BatchSpawnerBase.req_runtime = '12:00:00'
c.BatchSpawnerBase.req_memory = '6gb'
c.BatchSpawnerBase.req_gres = ''
c.BatchSpawnerBase.ip = '0.0.0.0'

c.SlurmSpawner.batch_script = '''#!/bin/bash
#SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
#SBATCH --job-name=jupyter-spawner
#SBATCH --chdir={homedir}
#SBATCH --export={keepvars}
#SBATCH --get-user-env=L
#SBATCH --partition={partition}
#SBATCH --time={runtime}
#SBATCH --mem={memory}
#SBATCH --cpus-per-task={nprocs}
#SBATCH --gres={gres}
#SBATCH {options}
module load openmpi/4.0.5
module load python/3.7.9
module load libs/python-3.7.9
module load cuda/10.1
module load gmsh/4.6.0
module load octopus/10.1
set -euo pipefail
trap 'echo SIGTERM received' TERM
which jupyterhub-singleuser
/opt/slurm/slurm-20.02.5/bin/srun {cmd}
echo "jupyterhub-singleuser ended gracefully"
'''
c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
c.ProfilesSpawner.ip = '0.0.0.0'
c.ProfilesSpawner.profiles = [
   ( "Head Node", 'local', 'jupyterhub.spawner.LocalProcessSpawner', {'ip': '0.0.0.0'}),
   ( "Compute Node, 1core, 6GB, 12 hours", 'compute-c1_r6_t0.5', 'batchspawner.SlurmSpawner', dict(ip='0.0.0.0')),
   ( "Compute Node, 1core, 6GB, 7 days", 'compute-c1_r6_t7', 'batchspawner.SlurmSpawner', dict(req_runtime='168:00:00')),
   ( "Compute Node, 20cores, 120GB, 7 days", 'compute-c20_r120_t7', 'batchspawner.SlurmSpawner', dict(req_nprocs='20', req_memory='120gb', req_runtime='168:00:00')),
   ( "Compute Node, 1cores, 15GB, 7 days, V100 GPU", 'compute-c1_r15_g-v100_t7', 'batchspawner.SlurmSpawner', dict(req_memory='15gb', req_runtime='168:00:00', req_gres='gpu:v100:1')),
]
c.JupyterHub.ssl_cert = '/etc/ssl/certs/jupyterhub-cert.pem'
c.JupyterHub.ssl_key = '/etc/ssl/keys/jupyterhub-key.pem'
c.Spawner.env_keep = ['PATH', 'LD_LIBRARY_PATH', 'PYTHONPATH', 'VIRTUAL_ENV', 'LANG', 'LC_ALL', 'MKL_NUM_THREADS', 'OMP_NUM_THREADS', 'OPENBLAS_NUM_THREADS', 'BASH_FUNC__moduleraw', 'BASH_FUNC_switchml', 'BASH_FUNC_module', 'MODULESHOME', 'MODULEPATH', 'MODULES_CMD']
c.Spawner.environment = {}
c.PAMAuthenticator.open_sessions = False
c.SingleUserNotebookApp.shutdown_no_activity_timeout = 7*24*60*60
c.NotebookApp.shutdown_no_activity_timeout = 7*24*60*60
c.MappingKernelManager.cull_idle_timeout = 7*24*60*60 # 1 week
c.MappingKernelManager.cull_interval = 24*60*60