jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

IMPORTANT: Worker starts sucessfully but then gets killed by jupyterhub #194

Closed Hoeze closed 4 years ago

Hoeze commented 4 years ago

Hi, I got a huge problem: The worker starts normally but the jupyterhub directly kills it. Logs of the worker:

+ batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0 --NotebookApp.default_url=/lab
[I 2020-11-05 13:22:43.763 SingleUserNotebookApp manager:81] [nb_conda_kernels] enabled, 18 kernels found
[I 2020-11-05 13:22:44.808 SingleUserNotebookApp extension:162] JupyterLab extension loaded from /opt/modules/i12g/anaconda/envs/jupyterhub/lib/python3.7/site-packages/jupyterlab
[I 2020-11-05 13:22:44.808 SingleUserNotebookApp extension:163] JupyterLab application directory is /opt/modules/i12g/anaconda/envs/jupyterhub/share/jupyter/lab
[I 2020-11-05 13:22:44.988 SingleUserNotebookApp __init__:34] [Jupytext Server Extension] Deriving a JupytextContentsManager from LargeFileManager
[I 2020-11-05 13:22:44.989 SingleUserNotebookApp singleuser:561] Starting jupyterhub-singleuser server version 1.1.0
[I 2020-11-05 13:22:44.996 SingleUserNotebookApp notebookapp:2209] Serving notebooks from local directory: /data/nasif12/home_if12/hoelzlwi
[I 2020-11-05 13:22:44.996 SingleUserNotebookApp notebookapp:2209] Jupyter Notebook 6.1.4 is running at:
[I 2020-11-05 13:22:44.996 SingleUserNotebookApp notebookapp:2209] http://[...]:50758/
[I 2020-11-05 13:22:44.996 SingleUserNotebookApp notebookapp:2210] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 2020-11-05 13:22:45.010 SingleUserNotebookApp singleuser:542] Updating Hub with activity every 300 seconds
slurmstepd: error: *** JOB 377371 ON [...] CANCELLED AT 2020-11-05T13:23:39 ***

Logs of jupyterhub:


[I 2020-11-05 13:27:39.649 JupyterHub log:181] 302 POST /jupyter/hub/spawn/<user> -> /jupyter/hub/spawn-pending/<user> (<user>@192.168.16.11) 1013.71ms
[I 2020-11-05 13:27:39.761 JupyterHub pages:398] <user> is pending spawn
[I 2020-11-05 13:27:39.771 JupyterHub log:181] 200 GET /jupyter/hub/spawn-pending/<user> (<user>@192.168.16.11) 29.93ms
[I 2020-11-05 13:27:41.587 JupyterHub log:181] 200 POST /jupyter/hub/api/batchspawner (<user>@192.168.16.13) 24.47ms
[I 2020-11-05 13:27:43.792 JupyterHub log:181] 200 GET /jupyter/hub/api (@192.168.16.13) 2.84ms
[I 2020-11-05 13:27:43.843 JupyterHub log:181] 200 POST /jupyter/hub/api/users/<user>/activity (<user>@192.168.16.13) 36.08ms
[W 2020-11-05 13:27:48.647 JupyterHub base:995] User <user> is slow to start (timeout=10)
[W 2020-11-05 13:28:38.764 JupyterHub user:684] <user>'s server failed to start in 60 seconds, giving up
[I 2020-11-05 13:28:39.153 JupyterHub batchspawner:408] Stopping server job 377372
[I 2020-11-05 13:28:39.155 JupyterHub batchspawner:293] Cancelling job 377372: sudo -E -u <user> scancel 377372
[W 2020-11-05 13:28:51.948 JupyterHub batchspawner:419] Notebook server job 377372 at node03:0 possibly failed to terminate
[E 2020-11-05 13:28:52.010 JupyterHub gen:624] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /opt/modules/i12g/anaconda/envs/jupyterhub/lib/python3.7/site-packages/jupyterhub/handlers/base.py:884> exception=TimeoutError('Timeout')> after timeout
    Traceback (most recent call last):
      File "/opt/modules/i12g/anaconda/envs/jupyterhub/lib/python3.7/site-packages/tornado/gen.py", line 618, in error_callback
        future.result()
      File "/opt/modules/i12g/anaconda/envs/jupyterhub/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 891, in finish_user_spawn
        await spawn_future
      File "/opt/modules/i12g/anaconda/envs/jupyterhub/lib/python3.7/site-packages/jupyterhub/user.py", line 708, in spawn
        raise e
      File "/opt/modules/i12g/anaconda/envs/jupyterhub/lib/python3.7/site-packages/jupyterhub/user.py", line 607, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[I 2020-11-05 13:28:52.019 JupyterHub log:181] 200 GET /jupyter/hub/api/users/<user>/server/progress (<user>@192.168.16.11) 71227.51ms

image

I am using python 3.7, jupyterhub 1.2, batchspawner 1.0.1 and the current git version of wrapspawner.

What can be the cause of this problem?

Hoeze commented 4 years ago

I think it's a problem with the ProfilesSpawner. When I directly use batchspawner, it works.

Hoeze commented 4 years ago

I moved this to ProfilesSpawner. If the maintainers agree that this issues is independent of batchspawner, please close :)

rcthomas commented 4 years ago

Agree, this can be closed here since the repro at jupyterhub/wrapspawner#41 can work without batchspawner at all