jupyterhub / batchspawner

Custom Spawner for Jupyterhub to start servers in batch scheduled systems
BSD 3-Clause "New" or "Revised" License
190 stars 134 forks source link

Batchspawner spawning / keep-alive is instable #174

Closed Hoeze closed 3 years ago

Hoeze commented 4 years ago

Hi, I often end up in the following situation:

Also, after ~ 24h the jupyterlab-singleuser worker looses connection to the JupyterHub. However, depending on the configuration I would like to have the jupyter instances running for more than one day.

Does somebody know why the connection between JupyterHub and the worker is so instable? Could https://github.com/jupyterhub/jupyterhub/pull/2727 be the solution?

rkdarst commented 4 years ago

Can you provide more info - this isn't enough to debug (these logs show how it starts, not how it stops).

In particular, what kind of batch job are you running and what caused the batch job to terminate? Did the batch job exceed some time or memory limit? (check batch system and job logs) Did the hub itself lose contact and trigger batchspawner to cancel it? (check hub logs) This could be by the jupyter job ceasing to function before the batch job terminates, causing the periodic hub status check to fail.

The spawn callback PR is unrelated and would only affect the starting.

Hoeze commented 4 years ago

Thank you for your answer @rkdarst. At some point, the singleuser notebook fails to update the jupyterhub:

[E 2020-03-30 06:33:48.787 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2020-03-30 06:33:49.025 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2020-03-30 06:33:49.242 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2020-03-30 06:46:51.369 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2020-03-30 06:46:59.569 SingleUserNotebookApp singleuser:523] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 521, in notify
        await client.fetch(req)
    tornado.httpclient.HTTPClientError: HTTP 403: Forbidden
[E 2020-03-30 06:46:59.569 SingleUserNotebookApp singleuser:548] Error notifying Hub of activity
    Traceback (most recent call last):
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 546, in keep_activity_updated
        await self.notify_activity()
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/singleuser.py", line 533, in notify_activity
        timeout=60,
      File "/opt/anaconda/3-2019.10/lib/python3.7/site-packages/jupyterhub/utils.py", line 177, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Failed to notify Hub of activity

Jupyterhub logs around the time of failure:

Mar 30 06:32:34 cn02 run_jupyterhub[34349]: 06:32:34.331 [ConfigProxy] ESC[32minfoESC[39m: 200 GET /api/routes
Mar 30 06:32:34 cn02 run_jupyterhub[34349]: [I 2020-03-30 06:32:34.333 JupyterHub proxy:320] Checking routes
Mar 30 06:33:04 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:04.421 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 5.19ms
Mar 30 06:33:05 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:05.392 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.79ms
Mar 30 06:33:06 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:06.063 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.67ms
Mar 30 06:33:07 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:07.550 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.65ms
Mar 30 06:33:13 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:13.540 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 4.07ms
Mar 30 06:33:18 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:18.016 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.60ms
Mar 30 06:33:19 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:19.193 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.63ms
Mar 30 06:33:21 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:21.782 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.69ms
Mar 30 06:33:36 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:36.793 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.60ms
Mar 30 06:33:48 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:48.786 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.82ms
Mar 30 06:33:49 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:49.024 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.44ms
Mar 30 06:33:49 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:49.241 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.48ms
Mar 30 06:33:49 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:49.319 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.50ms
Mar 30 06:33:50 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:50.814 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.35ms
Mar 30 06:33:51 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:51.806 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.66ms
Mar 30 06:33:52 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:52.730 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.55ms
Mar 30 06:33:59 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:59.179 JupyterHub log:174] 403 POST /jupyter/hub/api/users/kp/activity (@192.168.16.12) 3.92ms
Mar 30 06:33:59 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:33:59.649 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.60ms
Mar 30 06:34:14 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:34:14.658 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.45ms
Mar 30 06:34:29 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:34:29.667 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.97ms
Mar 30 06:34:43 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:34:43.944 JupyterHub log:174] 403 POST /jupyter/hub/api/users/hi/activity (@192.168.20.12) 3.84ms
Mar 30 06:37:19 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:37:19.858 JupyterHub log:174] 403 POST /jupyter/hub/api/users/n/activity (@192.168.16.12) 4.11ms
Mar 30 06:37:20 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:37:20.317 JupyterHub log:174] 403 POST /jupyter/hub/api/users/n/activity (@192.168.16.12) 4.29ms
Mar 30 06:37:20 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:37:20.748 JupyterHub log:174] 403 POST /jupyter/hub/api/users/n/activity (@192.168.16.12) 4.01ms
Mar 30 06:37:21 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:37:21.339 JupyterHub log:174] 403 POST /jupyter/hub/api/users/n/activity (@192.168.16.12) 3.84ms
Mar 30 06:37:27 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:37:27.973 JupyterHub log:174] 403 POST /jupyter/hub/api/users/n/activity (@192.168.16.12) 3.61ms
Mar 30 06:37:28 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:37:28.225 JupyterHub log:174] 403 POST /jupyter/hub/api/users/n/activity (@192.168.16.12) 3.68ms
Hoeze commented 4 years ago

Ah, I think this might be caused by https://github.com/jupyterhub/batchspawner/issues/171:

Mar 30 06:28:37 cn02 run_jupyterhub[34349]: [E 2020-03-30 06:28:37.036 JupyterHub batchspawner:225] Subprocess returned exitcode 1
Mar 30 06:28:37 cn02 run_jupyterhub[34349]: [E 2020-03-30 06:28:37.036 JupyterHub batchspawner:226] slurm_load_jobs error: Unable to contact slurm controller (connect failure)
Mar 30 06:28:37 cn02 run_jupyterhub[34349]: [E 2020-03-30 06:28:37.036 JupyterHub batchspawner:279] Error querying job 301047
Mar 30 06:28:37 cn02 run_jupyterhub[34349]: [W 2020-03-30 06:28:37.037 JupyterHub base:1012] User <me> server stopped, with exit code: 1
Mar 30 06:28:37 cn02 run_jupyterhub[34349]: [I 2020-03-30 06:28:37.037 JupyterHub proxy:282] Removing user <me> from proxy (/jupyter/user/<me>/)
rkdarst commented 4 years ago

This will be fixed by #179.

linhbngo commented 4 years ago

@Hoeze We ran into the same issue at our cluster (OpenPBS). Since PR #187 has not been merged yet, we side-stepped by set the JupyterHub Spawner's poll_interval configuration to 1800 seconds (default is 30 seconds), and it seemed to help.

consideRatio commented 3 years ago

Closed by #187