jupyterhub / systemdspawner

Spawn JupyterHub single-user notebook servers with systemd
BSD 3-Clause "New" or "Revised" License
91 stars 49 forks source link

A user servers's systemd service in a failed state fails to start up properly again #87

Open lachlancampbell opened 5 years ago

lachlancampbell commented 5 years ago

I'm having problems where if I leave a hub running for an extended period of time (weeks or longer) it becomes slow to respond and sometimes unable to start a user's server, for example:

May 27 03:07:25 littlehub python3[1910]: [I 2019-05-27 03:07:25.408 JupyterHub proxy:301] Checking routes
May 27 03:07:33 littlehub python3[1910]: [W 2019-05-27 03:07:33.210 JupyterHub base:714] User lukeedwards is slow to start (timeout=10)
May 27 03:07:33 littlehub python3[1910]: [I 2019-05-27 03:07:33.212 JupyterHub log:158] 202 POST /hub/api/users/lukeedwards/server (lachlan@106.69.254.220) 10064.93ms
May 27 03:08:02 littlehub python3[1910]: [I 2019-05-27 03:08:02.138 JupyterHub log:158] 200 GET /hub/admin (lachlan@106.69.254.220) 49.57ms
May 27 03:08:23 littlehub python3[1910]: [W 2019-05-27 03:08:23.294 JupyterHub user:471] lukeedwards's server failed to start in 60 seconds, giving up
May 27 03:08:23 littlehub python3[1910]: [E 2019-05-27 03:08:23.380 JupyterHub gen:593] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
May 27 03:08:23 littlehub python3[1910]:     Traceback (most recent call last):
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/tornado/gen.py", line 589, in error_callback
May 27 03:08:23 littlehub python3[1910]:         future.result()
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
May 27 03:08:23 littlehub python3[1910]:         await spawn_future
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 489, in spawn
May 27 03:08:23 littlehub python3[1910]:         raise e
May 27 03:08:23 littlehub python3[1910]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 409, in spawn
May 27 03:08:23 littlehub python3[1910]:         url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
May 27 03:08:23 littlehub python3[1910]:     tornado.util.TimeoutError: Timeout

I've tried reloading the hub, and restarting the VM (Ubuntu 18.04 on Openstack) with no success.

lachlancampbell commented 5 years ago

I also tried deleting the user and creating a new user, but faced the same issues. At this stage I'm going to trash the VM and start again, but I'd like to know a better solution :)

lachlancampbell commented 5 years ago

Don't know if it's relevant, but the traefik logs show a lot of the following errors:

    "__CURSOR" : "s=440e4b14ca704e02b779fd869a74565b;i=ea0d0;b=09c8ee45c9134834a058a7078bc785bc;m=4ec0605bad;t=589d59c2ddaee;x=ef952bcfedaec3d7",
    "__REALTIME_TIMESTAMP" : "1558925424843502",
    "__MONOTONIC_TIMESTAMP" : "338234989485",
    "_BOOT_ID" : "09c8ee45c9134834a058a7078bc785bc",
    "PRIORITY" : "6",
    "SYSLOG_FACILITY" : "3",
    "_MACHINE_ID" : "a95e814a37844c22b16c0168297d3050",
    "_HOSTNAME" : "littlehub",
    "_TRANSPORT" : "stdout",
    "_UID" : "0",
    "_GID" : "0",
    "_SELINUX_CONTEXT" : "unconfined\n",
    "_SYSTEMD_SLICE" : "system.slice",
    "SYSLOG_IDENTIFIER" : "traefik",
    "_COMM" : "traefik",
    "_EXE" : "/opt/tljh/hub/bin/traefik",
    "_CMDLINE" : "/opt/tljh/hub/bin/traefik -c /opt/tljh/state/traefik.toml",
    "_CAP_EFFECTIVE" : "3ff7fcffff",
    "_SYSTEMD_CGROUP" : "/system.slice/traefik.service",
    "_SYSTEMD_UNIT" : "traefik.service",
    "_STREAM_ID" : "69f2bb4f58c14dadb0e8bc0a73300da4",
    "_PID" : "1121",
    "_SYSTEMD_INVOCATION_ID" : "0499aef90ffd4afab1e65e84d93e5d84",
    "MESSAGE" : "time=\"2019-05-27T02:50:24Z\" level=error msg=\"Error while closing Hijacked conn: close tcp 192.168.1.13:80->52.63.84.227:48524: use of closed network connection\""
dmpe commented 5 years ago

Also facing similar issues. Hub starts but spawning notebooks leads to the timeout issue

arinbasu commented 5 years ago

I installed tljh on a local instance of vultr.com, but several errors:

lachlancampbell commented 5 years ago

Still being affected by this issue. Started up a new instance, new install, added users, began logging in using user accounts to check server startup. Worked fine for 10, then would timeout. Would work fine again for 5, then timeout again. No traefik errors showing this time, nothing in the users server logs, just the timeout errors in the jupyterhub logs.

GeorgianaElena commented 5 years ago

@lachlancampbell, there's a service we added not so long ago that's culling the notebook servers that have been idle for more than 10 min (this is the default value). If you're using the latest version on master, this might be what's stopping the notebooks.

Checkout the docs on how to change the idle timeout and other defaults to fit your needs.

lachlancampbell commented 5 years ago

Okay, thanks. I'll try turning the culling service off, although would it stop servers from starting up, as that's what is happening?

GeorgianaElena commented 5 years ago

@lachlancampbell, no the idle culler shouldn't stop servers from starting up, it just stops them after 10 min of inactivity. The timeout error you're receiving on this fresh install is the same with the one posted in the initial comment? Do you see anything in the logs related to the idle culler?

lachlancampbell commented 5 years ago

I believe it's the same error, but I've included a section of the full log below. As you can see, the idle culler does appear but not sure that it's related.

-- Logs begin at Wed 2019-07-17 05:49:40 UTC, end at Tue 2019-07-23 01:56:11 UTC. --
Jul 17 06:56:42 littlehub python3[12245]: [I 2019-07-17 06:56:42.935 JupyterHub base:663] User logged in: cou000
Jul 17 06:56:42 littlehub python3[12245]: [I 2019-07-17 06:56:42.941 JupyterHub log:174] 302 POST /hub/login?next= -> /hub/spawn (cou000@130.116.150.122) 458.34ms
Jul 17 06:56:43 littlehub useradd[12478]: new group: name=jupyter-cou000, GID=1005
Jul 17 06:56:43 littlehub useradd[12478]: new user: name=jupyter-cou000, UID=1003, GID=1005, home=/home/jupyter-cou000, shell=/bin/sh
Jul 17 06:56:43 littlehub python3[12245]: Adding user jupyter-cou000 to group jupyterhub-users
Jul 17 06:56:43 littlehub gpasswd[12485]: user jupyter-cou000 added by root to group jupyterhub-users
Jul 17 06:56:43 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.
Jul 17 06:56:43 littlehub python3[12245]: Running as unit: jupyter-cou000.service
Jul 17 06:56:43 littlehub python3[12245]: [I 2019-07-17 06:56:43.987 JupyterHub log:174] 302 GET /hub/spawn -> /hub/spawn-pending/cou000 (cou000@130.116.150.122) 1033.83ms
Jul 17 06:56:44 littlehub python3[12245]: [I 2019-07-17 06:56:44.050 JupyterHub pages:303] cou000 is pending spawn
Jul 17 06:56:44 littlehub python3[12245]: [I 2019-07-17 06:56:44.053 JupyterHub log:174] 200 GET /hub/spawn-pending/cou000 (cou000@130.116.150.122) 26.72ms
Jul 17 06:56:53 littlehub python3[12245]: [W 2019-07-17 06:56:53.004 JupyterHub base:932] User cou000 is slow to become responsive (timeout=10)
Jul 17 06:57:13 littlehub python3[12245]: [W 2019-07-17 06:57:13.993 JupyterHub user:678] cou000's server never showed up at http://127.0.0.1:39851/user/cou000/ after 30 seconds. Giving up
Jul 17 06:57:14 littlehub systemd[1]: Stopping /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851...
Jul 17 06:57:14 littlehub systemd[1]: Stopped /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.
Jul 17 06:57:14 littlehub python3[12245]: [E 2019-07-17 06:57:14.103 JupyterHub gen:599] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py:800> exception=TimeoutError("Server at http://127.0.0.1:39851/user/cou000/ didn't respond in 30 seconds",)> after timeout
Jul 17 06:57:14 littlehub python3[12245]:     Traceback (most recent call last):
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/tornado/gen.py", line 593, in error_callback
Jul 17 06:57:14 littlehub python3[12245]:         future.result()
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/handlers/base.py", line 807, in finish_user_spawn
Jul 17 06:57:14 littlehub python3[12245]:         await spawn_future
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 654, in spawn
Jul 17 06:57:14 littlehub python3[12245]:         await self._wait_up(spawner)
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 701, in _wait_up
Jul 17 06:57:14 littlehub python3[12245]:         raise e
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/user.py", line 669, in _wait_up
Jul 17 06:57:14 littlehub python3[12245]:         http=True, timeout=spawner.http_timeout, ssl_context=ssl_context
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/utils.py", line 234, in wait_for_http_server
Jul 17 06:57:14 littlehub python3[12245]:         timeout=timeout,
Jul 17 06:57:14 littlehub python3[12245]:       File "/opt/tljh/hub/lib/python3.6/site-packages/jupyterhub/utils.py", line 177, in exponential_backoff
Jul 17 06:57:14 littlehub python3[12245]:         raise TimeoutError(fail_message)
Jul 17 06:57:14 littlehub python3[12245]:     TimeoutError: Server at http://127.0.0.1:39851/user/cou000/ didn't respond in 30 seconds
Jul 17 06:57:14 littlehub python3[12245]:     
Jul 17 06:57:14 littlehub python3[12245]: [I 2019-07-17 06:57:14.114 JupyterHub log:174] 200 GET /hub/api/users/cou000/server/progress (cou000@130.116.150.122) 29913.91ms
Jul 17 06:57:28 littlehub python3[12245]: [I 2019-07-17 06:57:28.119 JupyterHub log:174] 200 GET /hub/api/users (cull-idle@127.0.0.1) 40.25ms
Jul 17 06:57:58 littlehub systemd-timesyncd[698]: Network configuration changed, trying to establish connection.
Jul 17 06:57:58 littlehub systemd-timesyncd[698]: Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).
Jul 17 06:58:13 littlehub sudo[12525]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl -u jupyterhub
Jul 17 06:58:13 littlehub sudo[12525]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Jul 17 06:58:28 littlehub python3[12245]: [I 2019-07-17 06:58:28.118 JupyterHub log:174] 200 GET /hub/api/users (cull-idle@127.0.0.1) 39.38ms
Jul 17 06:59:28 littlehub python3[12245]: [I 2019-07-17 06:59:28.114 JupyterHub log:174] 200 GET /hub/api/users (cull-idle@127.0.0.1) 36.45ms
Jul 17 07:00:26 littlehub sudo[12525]: pam_unix(sudo:session): session closed for user root
Jul 17 07:00:28 littlehub python3[12245]: [I 2019-07-17 07:00:28.119 JupyterHub log:174] 200 GET /hub/api/users (cull-idle@127.0.0.1) 40.36ms
Jul 17 07:00:42 littlehub sudo[12537]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl -u traefik
Jul 17 07:00:42 littlehub sudo[12537]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Jul 17 07:01:09 littlehub sudo[12537]: pam_unix(sudo:session): session closed for user root
Jul 17 07:01:16 littlehub python3[12245]: [I 2019-07-17 07:01:16.651 JupyterHub login:43] User logged out: cou000
Jul 17 07:01:16 littlehub python3[12245]: [I 2019-07-17 07:01:16.695 JupyterHub log:174] 302 GET /hub/logout -> /hub/login (cou000@130.116.150.122) 79.37ms
Jul 17 07:01:28 littlehub python3[12245]: [I 2019-07-17 07:01:28.114 JupyterHub log:174] 200 GET /hub/api/users (cull-idle@127.0.0.1) 36.36ms
Jul 17 07:01:28 littlehub python3[12245]: [I 2019-07-17 07:01:28.397 JupyterHub proxy:319] Checking routes
Jul 17 07:01:29 littlehub python3[12245]: [I 2019-07-17 07:01:29.029 JupyterHub base:663] User logged in: cou000
Jul 17 07:01:29 littlehub python3[12245]: [I 2019-07-17 07:01:29.030 JupyterHub log:174] 302 POST /hub/login?next= -> /hub/spawn (cou000@130.116.150.122) 475.30ms
Jul 17 07:01:29 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=52651.
Jul 17 07:01:29 littlehub python3[12245]: Running as unit: jupyter-cou000.service
Jul 17 07:01:29 littlehub traefik[12244]: time="2019-07-17T07:01:29Z" level=warning msg="A new release has been found: 1.7.12. Please consider updating."
Jul 17 07:01:29 littlehub python3[12245]: [I 2019-07-17 07:01:29.734 JupyterHub log:174] 200 POST /hub/api/users/lachlan/activity (lachlan@127.0.0.1) 40.94ms
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.100 JupyterHub log:174] 302 GET /hub/spawn -> /hub/spawn-pending/cou000 (cou000@130.116.150.122) 1031.31ms
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.133 JupyterHub pages:303] cou000 is pending spawn
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.751 SingleUserNotebookApp extension:168] JupyterLab extension loaded from /opt/tljh/user/lib/python3.6/site-packages/jupyterlab
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.751 SingleUserNotebookApp extension:169] JupyterLab application directory is /opt/tljh/user/share/jupyter/lab
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.800 SingleUserNotebookApp extension:22] nteract extension loaded from /opt/tljh/user/lib/python3.6/site-packages/nteract_on_jupyter
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.804 SingleUserNotebookApp singleuser:561] Starting jupyterhub-singleuser server version 1.0.0
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.814 JupyterHub log:174] 200 GET /hub/api (@127.0.0.1) 1.80ms
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1774] Serving notebooks from local directory: /home/jupyter-cou000
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1774] The Jupyter Notebook is running at:
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1774] http://127.0.0.1:52651/user/cou000/
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.818 SingleUserNotebookApp notebookapp:1775] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.829 SingleUserNotebookApp singleuser:542] Updating Hub with activity every 300 seconds
Jul 17 07:01:30 littlehub python3[12245]: [I 2019-07-17 07:01:30.884 JupyterHub log:174] 200 POST /hub/api/users/cou000/activity (cou000@127.0.0.1) 50.41ms
Jul 17 07:01:31 littlehub bash[12547]: [I 2019-07-17 07:01:31.779 SingleUserNotebookApp log:174] 302 GET /user/cou000/ -> /user/cou000/tree? (@127.0.0.1) 1.52ms
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.780 JupyterHub base:810] User cou000 took 2.683 seconds to start
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.781 JupyterHub proxy:261] Adding user cou000 to proxy /user/cou000/ => http://127.0.0.1:52651
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.783 JupyterHub proxy:114] Waiting for /user/cou000/ to register with traefik
Jul 17 07:01:31 littlehub traefik[12244]: time="2019-07-17T07:01:31Z" level=info msg="Server configuration reloaded on :80"
Jul 17 07:01:31 littlehub traefik[12244]: time="2019-07-17T07:01:31Z" level=info msg="Server configuration reloaded on :443"
Jul 17 07:01:31 littlehub traefik[12244]: time="2019-07-17T07:01:31Z" level=info msg="Server configuration reloaded on 127.0.0.1:8099"
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.995 JupyterHub users:606] Server cou000 is ready
Jul 17 07:01:31 littlehub python3[12245]: [I 2019-07-17 07:01:31.998 JupyterHub log:174] 200 GET /hub/api/users/cou000/server/progress (cou000@130.116.150.122) 1683.14ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.038 JupyterHub log:174] 302 GET /hub/spawn-pending/cou000 -> /user/cou000/ (cou000@130.116.150.122) 27.45ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.050 SingleUserNotebookApp log:174] 302 GET /user/cou000/ -> /user/cou000/tree? (@130.116.150.122) 2.14ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.068 SingleUserNotebookApp log:174] 302 GET /user/cou000/tree -> /hub/api/oauth2/authorize?client_id=jupyterhub-user-cou000&redirect_uri=%2Fuser%2Fcou000%2Foauth_callback&response_type=code&state=[secret] (@130.116.150.122) 6.04ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.151 JupyterHub log:174] 302 GET /hub/api/oauth2/authorize?client_id=jupyterhub-user-cou000&redirect_uri=%2Fuser%2Fcou000%2Foauth_callback&response_type=code&state=[secret] -> /user/cou000/oauth_callback?code=[secret]&state=[secret] (cou000@130.116.150.122) 69.34ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.256 JupyterHub log:174] 200 POST /hub/api/oauth2/token (cou000@127.0.0.1) 85.37ms
Jul 17 07:01:32 littlehub python3[12245]: [I 2019-07-17 07:01:32.338 JupyterHub log:174] 200 GET /hub/api/authorizations/token/[secret] (cou000@127.0.0.1) 71.52ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.340 SingleUserNotebookApp auth:981] Logged-in user {'kind': 'user', 'name': 'cou000', 'admin': False, 'groups': [], 'server': '/user/cou000/', 'pending': None, 'created': '2019-07-17T06:56:14.397855Z', 'last_activity': '2019-07-17T07:01:32.318000Z', 'servers': None}
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.342 SingleUserNotebookApp log:174] 302 GET /user/cou000/oauth_callback?code=[secret]&state=[secret] -> /user/cou000/tree (@130.116.150.122) 182.35ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.541 SingleUserNotebookApp log:174] 200 GET /user/cou000/tree (cou000@130.116.150.122) 130.01ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.963 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/config/common?_=1563346892845 (cou000@130.116.150.122) 7.40ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.970 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/config/tree?_=1563346892844 (cou000@130.116.150.122) 5.83ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.979 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/sessions?_=1563346892847 (cou000@130.116.150.122) 4.65ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.983 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/terminals?_=1563346892846 (cou000@130.116.150.122) 1.96ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.991 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/kernelspecs (cou000@130.116.150.122) 7.23ms
Jul 17 07:01:32 littlehub bash[12547]: [I 2019-07-17 07:01:32.995 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/terminals?_=1563346892848 (cou000@130.116.150.122) 2.22ms
Jul 17 07:01:33 littlehub bash[12547]: [I 2019-07-17 07:01:33.029 SingleUserNotebookApp log:174] 200 GET /user/cou000/api/contents?type=directory&_=1563346892849 (cou000@130.116.150.122) 7.16ms
Jul 17 07:01:41 littlehub bash[12547]: [I 2019-07-17 07:01:41.676 SingleUserNotebookApp log:174] 302 GET /user/cou000/logout -> /hub/logout (cou000@130.116.150.122) 4.19ms
Jul 17 07:01:41 littlehub python3[12245]: [I 2019-07-17 07:01:41.709 JupyterHub login:43] User logged out: cou000
Jul 17 07:01:41 littlehub python3[12245]: [I 2019-07-17 07:01:41.759 JupyterHub log:174] 302 GET /hub/logout -> /hub/login (cou000@130.116.150.122) 74.40ms
Jul 17 07:01:56 littlehub sudo[12572]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/journalctl -f -u jupyterhub
Jul 17 07:01:56 littlehub sudo[12572]: pam_unix(sudo:session): session opened for user root by ubuntu(uid=0)
Jul 17 07:02:26 littlehub systemd-timesyncd[698]: Network configuration changed, trying to establish connection.
Jul 17 07:02:26 littlehub systemd-timesyncd[698]: Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).
rikedyp commented 5 years ago
Jul 23 10:00:30 DyalogDemos python3[25943]: [I 2019-07-23 10:00:30.786 JupyterHub log:174] 200 GET /hub/home (dyalog@10.0.60.180) 27.30ms
Jul 23 10:00:32 DyalogDemos python3[25943]: [I 2019-07-23 10:00:32.816 JupyterHub log:174] 200 GET /hub/token (dyalog@10.0.60.180) 34.88ms
Jul 23 10:00:33 DyalogDemos python3[25943]: [I 2019-07-23 10:00:33.588 JupyterHub log:174] 200 GET /hub/admin (dyalog@10.0.60.180) 10.67ms
Jul 23 10:00:36 DyalogDemos python3[25943]: [I 2019-07-23 10:00:36.224 JupyterHub log:174] 200 GET /hub/home (dyalog@10.0.60.180) 13.75ms
Jul 23 10:00:37 DyalogDemos python3[25943]: [I 2019-07-23 10:00:37.335 JupyterHub systemdspawner:217] user:dyalog Unit jupyter-dyalog in a failed state. Resetting state.
Jul 23 10:00:37 DyalogDemos python3[25943]: Running as unit: jupyter-dyalog.service
Jul 23 10:00:38 DyalogDemos python3[25943]: [I 2019-07-23 10:00:38.281 JupyterHub log:174] 302 GET /hub/spawn/dyalog -> /hub/spawn-pending/dyalog (dyalog@10.0.60.180) 1010.83ms
Jul 23 10:00:38 DyalogDemos python3[25943]: [I 2019-07-23 10:00:38.314 JupyterHub pages:303] dyalog is pending spawn
Jul 23 10:00:38 DyalogDemos python3[25943]: [I 2019-07-23 10:00:38.322 JupyterHub log:174] 200 GET /hub/spawn-pending/dyalog (dyalog@10.0.60.180) 18.92ms
Jul 23 10:00:47 DyalogDemos python3[25943]: [W 2019-07-23 10:00:47.280 JupyterHub base:900] User dyalog is slow to start (timeout=10)
Jul 23 10:01:19 DyalogDemos python3[25943]: [W 190723 10:01:19 cull_idle_servers:129] Not culling server dyalog with pending spawn
Jul 23 10:01:19 DyalogDemos python3[25943]: [I 2019-07-23 10:01:19.547 JupyterHub log:174] 200 GET /hub/api/users (cull-idle@127.0.0.1) 12.79ms
Jul 23 10:01:37 DyalogDemos python3[25943]: [W 2019-07-23 10:01:37.318 JupyterHub user:618] dyalog's server failed to start in 60 seconds, giving up
Jul 23 10:01:37 DyalogDemos python3[25943]: [E 2019-07-23 10:01:37.346 JupyterHub gen:599] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn()
Jul 23 10:01:37 DyalogDemos python3[25943]:     Traceback (most recent call last):
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/tornado/gen.py", line 593, in error_callback
Jul 23 10:01:37 DyalogDemos python3[25943]:         future.result()
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/jupyterhub/handlers/base.py", line 807, in finish_user_spawn
Jul 23 10:01:37 DyalogDemos python3[25943]:         await spawn_future
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/jupyterhub/user.py", line 642, in spawn
Jul 23 10:01:37 DyalogDemos python3[25943]:         raise e
Jul 23 10:01:37 DyalogDemos python3[25943]:       File "/opt/tljh/hub/lib/python3.7/site-packages/jupyterhub/user.py", line 546, in spawn
Jul 23 10:01:37 DyalogDemos python3[25943]:         url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
Jul 23 10:01:37 DyalogDemos python3[25943]:     tornado.util.TimeoutError: Timeout
Jul 23 10:01:37 DyalogDemos python3[25943]:     
Jul 23 10:01:37 DyalogDemos python3[25943]: [I 2019-07-23 10:01:37.348 JupyterHub log:174] 200 GET /hub/api/users/dyalog/server/progress (dyalog@10.0.60.180) 58827.07ms
Jul 23 10:02:19 DyalogDemos python3[25943]: [I 2019-07-23 10:02:19.544 JupyterHub log:174] 200 GET /hub/api/users (cull-idle@127.0.0.1) 10.43ms
Jul 23 10:02:52 DyalogDemos python3[25943]: [I 2019-07-23 10:02:52.167 JupyterHub log:174] 302 GET / -> /hub/ (@127.0.0.1) 0.65ms
Jul 23 10:02:52 DyalogDemos python3[25943]: [I 2019-07-23 10:02:52.188 JupyterHub log:174] 302 GET /hub/ -> /hub/login (@127.0.0.1) 0.58ms
Jul 23 10:02:52 DyalogDemos python3[25943]: [I 2019-07-23 10:02:52.236 JupyterHub log:174] 200 GET /hub/login (@127.0.0.1) 17.13ms
Jul 23 10:02:56 DyalogDemos python3[25943]: [I 2019-07-23 10:02:56.403 JupyterHub base:663] User logged in: dyalog
Jul 23 10:02:56 DyalogDemos python3[25943]: [I 2019-07-23 10:02:56.409 JupyterHub log:174] 302 POST /hub/login?next= -> /hub/spawn (dyalog@127.0.0.1) 308.13ms
Jul 23 10:02:56 DyalogDemos python3[25943]: [I 2019-07-23 10:02:56.477 JupyterHub systemdspawner:217] user:dyalog Unit jupyter-dyalog in a failed state. Resetting state.
Jul 23 10:02:56 DyalogDemos python3[25943]: Running as unit: jupyter-dyalog.service
Jul 23 10:02:57 DyalogDemos python3[25943]: [I 2019-07-23 10:02:57.435 JupyterHub log:174] 302 GET /hub/spawn -> /hub/spawn-pending/dyalog (dyalog@127.0.0.1) 1022.26ms
Jul 23 10:02:57 DyalogDemos python3[25943]: [I 2019-07-23 10:02:57.458 JupyterHub pages:303] dyalog is pending spawn
Jul 23 10:02:57 DyalogDemos python3[25943]: [I 2019-07-23 10:02:57.461 JupyterHub log:174] 200 GET /hub/spawn-pending/dyalog (dyalog@127.0.0.1) 14.52ms
Jul 23 10:03:06 DyalogDemos python3[25943]: [W 2019-07-23 10:03:06.433 JupyterHub base:900] User dyalog is slow to start (timeout=10)
minrk commented 5 years ago

@lachlancampbell try checking the logs of the server that failed to show up:

systemctl -u jupyter-cou000

You might also check for load on the system with htop (might require apt -y install htop).

What's interesting from the logs is that we see systemd attempting to start the server with:

Jul 17 06:56:43 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.

But no output from that process. I think that's the root, that systemd is actually taking a long time to launch the requested process, or even failing entirely to do so. It's especially strange that this same action succeeds in one second, five minutes later:

Jul 17 07:01:29 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=52651.
Jul 17 07:01:29 littlehub python3[12245]: Running as unit: jupyter-cou000.service
...
# first output from the started server (note SingleUserNotebookApp):
Jul 17 07:01:30 littlehub bash[12547]: [I 2019-07-17 07:01:30.751 SingleUserNotebookApp extension:168] JupyterLab extension loaded from /opt/tljh/user/lib/python3.6/site-packages/jupyterlab

I don't have a good answer yet for what could cause this to take a long time, or perhaps fail altogether. It might be worth checking systemctl status jupyter-cou000 while a launch that's going to fail is in the middle of trying to start. Or just systemctl status. That might shed some light on what's going on, such as exhausting some resources.

Are there other active users when this is happening?

lachlancampbell commented 5 years ago

Logs for jupyter-cou000 don't show anything more:

Jul 17 06:56:43 littlehub systemd[1]: Started /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.
Jul 17 06:57:14 littlehub systemd[1]: Stopping /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851...
Jul 17 06:57:14 littlehub systemd[1]: Stopped /bin/bash -c cd /home/jupyter-cou000 && exec jupyterhub-singleuser --port=39851.

When this occurred I was the only user, so shouldn't have been that. I'll try and replicate it and check the system status as you suggest.

It's possible it could be an intermittent problem with our cloud, although since others reported the same issue I was hoping it wasn't that.

rgaushell commented 4 years ago

I've hit the exact same issue as everyone else here. I can't exactly come up with reproduction steps, but it's something like:

I have a workaround I've been using, but would really like to find and/or help you patch a permanent solution. If someone can point me at the code that is breaking, I have a team of engineers who would happily submit a PR.

Workaround:

sudo bash
systemctl stop jupyter-username.service
systemctl start jupyter-username.service

Environment: Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0-1027-aws x86_64) hub: Python 3.6.8 (default, Jan 14 2019, 11:02:34) user: Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)

rgaushell commented 4 years ago

I also just found that if I run: systemctl reset-failed it clears out failed services and then TLJH can start a server that wouldn't previously start.

lachlancampbell commented 4 years ago

Still haven't found a resolution for this. Ran two more training workshops this week, faced the same startup inconsistencies, and still haven't been able to diagnose it. Unfortunately resetting failed services didn't work for me, as the user notebook servers are not listed as failed. The hubs worked well during the sessions, but the problems starting up servers is still irritating. Would be grateful if someone could suggest another way of attempting to diagnose what's going on.

dwr-psandhu commented 4 years ago

I had the same problems, however when I looked into the logs (/var/log/system...) it seemed like python was failing on import jupyterhub.

I had to sudo to jupyterhub-admin-user and then run conda init as advised in the anaconda faq

mrwen00 commented 4 years ago

I've got the same problem. fixed it by deleting your created account, for example user jupyterhub-admin. Delete it with "sudo userdel jupyterhub-admin". After removing the account and create again, it works perfectly.

quinten-goens commented 4 years ago

I also just found that if I run: systemctl reset-failed it clears out failed services and then TLJH can start a server that wouldn't previously start.

This indeed solved the issue for me. Many thanks @rgaushell

I caused this issue by opening too many tabs in Jupyterlab which crashed the service.

junqfisica commented 4 years ago

My problem is that after a few days running the TLJH the server reaches 100% cpu usage. This causes the server to stop responding.

consideRatio commented 2 years ago

Issue takeaways

It seems that starting a user server in TLJH means to start a systemd service via jupyterhub-systemdspawner. A systemd service may sometimes fail, and if a user server's systemd service in a failed state, it can block startup of a user server.

An example reproduction strategy could be the following: https://github.com/jupyterhub/the-littlest-jupyterhub/issues/351#issuecomment-525379200, but I think we can reproduce it faster by crashing the user server in some other way.

I would guess this to be a systemdspawner issue rather than a TLJH issue, but I'm not sure and could not find a bug reported about that in that repo at this point. I'm moving this issue from TLJH to there.

Action points

welcome[bot] commented 2 years ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

consideRatio commented 1 year ago

It seems like use of systemctl reset-failed has worked well, so maybe we should check for a failed state and use systemctl reset-failed <user-servers-unit-name> if needed before starting a user server?