Idle culling will stop working with traefik proxy

minrk commented 5 years ago

Binder currently relies on JupyterHub's activity tracking. With JupyterHub < 1.0, this information comes solely from configurable-http-proxy. JupyterHub 1.0 moves the onus for this to jupyterhub-singleuser because alternative proxies like traefik do not track activity. This is better for JupyterHub in general, but since binder launches vanilla notebook servers and not jupyterhub-singleuser, this activity is not tracked at the network level.

Additionally, we have learned on mybinder.org that the notebook's internal activity tracking is better and more reliable since it can make more fine-grained activity decisions (e.g. choosing to cull with idle but connected websockets).

So we have some facts:

cull_idle_servers.py assumes the behavior of jupyterhub-singleuser or configurable-http-proxy for activity tracking
binderhub assumes a notebook server, but not jupyterhub-singleuser
notebook servers track activity internally
we've stated a few places that we only want to assume an http server for Binder, but this cannot be achieved if we want to keep the activity tracking necessary for idle culling, unless we deploy a sidecar container in user pods that implements activity tracking (this pod could be configurable-http-proxy!)

I'm not sure exactly what we should do about this, but it's a pretty big issue and a blocker for adopting more resilient proxies in BinderHub. If we continue to assume the notebook server in BinderHub, we can write a new idle-culler that talks directly to the notebook API, ignoring the Hub's activity data. This will be quite inefficient, as it requires lots more requests to notebook servers rather than a single request to the Hub (this can be scaled by sharding the culler). This is also getting closer and closer to not using JupyterHub for anything at all. If we want to skip over that and remove the notebook server assumption, we need to get to work on a sidecar container that at least implements activity tracking (reintroducing the problem of network activity not being as good as internal activity tracking), and possibly also implements auth.

minrk commented 5 years ago

The main reason we don't require jupyterhub-singleuser in repo2docker is that it would be a pain to put it in the containers, since it is sensitive to the version in the Hub pod.

This is another case, I think, for the transparent jupyterhub-auth-proxy discussed occasionally with @yuvipanda, which should implement at least:

auth with hub
activity tracking

Since I think this is mostly useful in container-based deployments like kubernetes, requiring Python so that it can import the jupyterhub auth implementation from jupyterhub itself is probably the best way to go from a maintainability standpoint (as opposed to go, which has benefits for portability, but not something I think our team has the capacity to develop and maintaining at this point). Then this proxy would go in a sidecar container, exerting no requirements on the user container beyond an http endpoint that can run on a prefix.

With that, we could preserve the assumption that the user pods are fully equipped jupyterhub pods (implementing auth, etc.), while also separating it from the user env.

yuvipanda commented 5 years ago

@minrk I think for activity tracking, we can say something like 'we will hit /activity, and it should return the timestamp of last activity' (or something like that) that then becomes a protocol that can be implemented by multiple server implementations. We can do that in a sidecar, in singleuser, etc as we see fit. If we know we're running notebook, this can just use the internal notebook activity tracking. Else it can rely on network or some other mechanism. I don't think tracking this through would be too slow - this is the same as prometheus's model. I think JupyterHub should do this tracking itself internally if possible...

In general, I think it'll be great to explicitly define what a 'jupyterhub equipped pod' means, and we can go from there.

I agree re: go. My ideal would be to find a maintained proxy from somewhere else that can give us the behavior we want purely through configuration rather than requiring us to write and maintain code.

minrk commented 5 years ago

I think JupyterHub should do this tracking itself internally if possible...

I'm not sure what you mean by internally tracking here. The design of JupyterHub is that the Hub is completely not involved during normal user interaction with their own server(s). So it is 100% on the proxy and/or server to implement activity tracking. It is already the case that the Hub is responsible for storing the activity, so it's only one request for Hub API clients to check last_activity of all servers, if that's what you are referring to.

we will hit /activity, and it should return the timestamp of last activity

That's an interesting idea. This will have to be configurable, but we can do it. JupyterHub 1.0 reverses this - singleuser servers push activity rather than the Hub pulling it. The auth circumvention in binderhub makes pull a challenge, since jupyterhub's token auth to the API doesn't work.

In general, I think it'll be great to explicitly define what a 'jupyterhub equipped pod' means, and we can go from there.

Yeah, we can work on this. I believe what we have with JupyterHub 1.0 is:

it's authenticated with oauth (I view binderhub's circumvention of hub auth as something that shouldn't be part of a jupyterhub spec)
it's http(s) on given url
it supports running on a /user/$name/$servername prefix
it tracks internal activity and publishes it to the jupyerhub activity API (not strictly necessary, but required for idle culling with non-default proxies)
we specify the relevant environment variables that contain this info

The biggest challenge with activity tracking when using the jupyterhub-naïve notebook server, either push or pull, is that it requires server extensions and coordinated auth (jupyterhub can't make authenticated requests to binder notebooks). This is a challenge for Binder, where we don't have a good answer for installing deployment-sensitive extensions.

yuvipanda commented 5 years ago

Thanks for the detailed response, @minrk!

By 'internally', I meant the code to make these external calls and keep a note of their last activity status should be in JupyterHub. However, if the tracking is 'push' in 1.0 (I didn't know this!) that sounds awesome and much more efficient. I guess this too can be part of the JupyterHub server protocol as you mention.

Do you think we can formally write that up somewhere?

minrk commented 5 years ago

Do you think we can formally write that up somewhere?

Yes, absolutely. I put the skeleton of what should be included above as a note because I only had a few minutes, but the detailed version of this definitely belongs in the jupyterhub docs. Probably a new page.

adriendelsalle commented 5 years ago

Hi everyone

I'm currently implementing a JupyterHub and a BinderHub and I'm facing some issues with culling. I think it could help to have this little memo in parallel of @minrk list of facts:

current culling capabilities in Jupyter classic notebook work fine -- it aims notebooks kernels only
current culling capabilities in Jupyter Lab does not work as expected and is not consistent with classic notebook, see an issue I opened recently on JLab JLab #6893 -- possibility to force culling of notebooks with the cull_connected option of notebook -- possibility to force closing of kernels and terminals when closing the tabs with the JLab settings
culling of terminals and consoles is not implemented for both classic notebook and JLab, resulting to a shutdown_no_activity_time option not effective even when all tabs are closed -- I'm not sure a console prevent the server from culling, it's just a guess. Verified for a terminal. -- they should have their own sets of options cull_idle_time/cull_connected, a shared cull_busy for notebooks and consols, and a shared cull_interval for notebooks/consoles/terminals. See Notebook #3868
No max absolute session time at single-user server level -- the shutdown_no_activity_time at user-server level could be completed with a max_age to set an absolute server time as defined in JHub cull_idle_servers.py script

We currently have those options:

Notebook
- MappingKernelManager
  - cull_busy (bool): Whether to consider culling kernels which are busy
  - cull_connected (bool): Whether to consider culling kernels which have one or more connections
  - cull_idle_timeout (int): Timeout (in seconds) after which a kernel is considered idle and ready to be culled
- NotebookApp
  - shutdown_no_activity_time (int): Shut down the server after N seconds with no kernels or terminals running and no activity.
JLab
- @jupyterlab/notebook-extension:tracker
  - kernelShutdown (bool): shutdown the kernel when closing the notebook's tab
- @jupyterlab/terminal-extension:plugin
  - shutdownOnClose (bool): shutdown the terminal when closing the terminal's tab
JHub
- cull_idle_servers.py script: cull the servers with no activity after a given timeout
  - timeout (int): idle timeout for single-user servers
  - max_age (int): max age of a single-user server for culling, even if active
  - if the tabs are closed, the activity can't be monitored and active servers could be culled
  - does it works with JLab considering the previously referenced issue JLab #6893 ?
BinderHub
- Uses JHub cull_idle_servers.py script through values.yaml

Other options allow to determine frequency for checking activity/culling/culling users/etc. I focused on kernels/servers culling.

It looks like a lot of stuff to do, mainly at notebook/JLab levels. I don't know how Jupyter works to address priority levels/consistency/user experience between notebook/lab/hub/binder. Can you raise this point as active members of Jupyter?

Feel free to complete/correct my understanding of the situation. I would be happy to contribute!

minrk commented 5 years ago

All of the classic notebook's culling features are available in JupyterLab because those are server-side features and jupyterlab uses the same server (soon a fork of the same server with the same features, but still). JupyterHub's culling in general works just fine with JupyterLab, but can be hindered somewhat by JupyterLab's sometimes overzealous polling behavior (I believe this is the linked lab issue).

I don't think there's necessarily a whole lot to do. Adding internal max-age is easy to do, even via a server extension:

from traitlets.config.application import Application

max_age = 3600 # one hour

def shutdown():
    Application.instance().stop()

IOLoop.current().call_later(max_age, shutdown)

Culling terminals with similar parameters to kernels makes perfect sense.

The JupyterLab polling is a recurring issue, and trying to get JupyterLab to do less with "idle" things (and what counts as idle?) is always a question.

yuvipanda commented 11 months ago

@minrk I think this can be closed, right?

minrk commented 11 months ago

I don't think so. If the JupyterHub chart switched to traefik from chp, binderhub would have to disable the idle culler because it wouldn't work, as the Hub would have no sources of activity for binder pods (unless auth is enabled).

https://github.com/jupyterhub/traefik-proxy/issues/151 is the issue for activity tracking in traefik-proxy, which I think is doable (if we can assume prometheus), but a nontrivial amount of work and has some tricky decisions with tradeoffs to consider.

jupyterhub / binderhub

Idle culling will stop working with traefik proxy #831