jupyterhub / binderhub

Run your code in the cloud, with technology so advanced, it feels like magic!
https://binderhub.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.53k stars 386 forks source link

Idle culling will stop working with traefik proxy #831

Open minrk opened 5 years ago

minrk commented 5 years ago

Related to https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1162 and https://github.com/jupyterhub/jupyterhub/pull/2346

Binder currently relies on JupyterHub's activity tracking. With JupyterHub < 1.0, this information comes solely from configurable-http-proxy. JupyterHub 1.0 moves the onus for this to jupyterhub-singleuser because alternative proxies like traefik do not track activity. This is better for JupyterHub in general, but since binder launches vanilla notebook servers and not jupyterhub-singleuser, this activity is not tracked at the network level.

Additionally, we have learned on mybinder.org that the notebook's internal activity tracking is better and more reliable since it can make more fine-grained activity decisions (e.g. choosing to cull with idle but connected websockets).

So we have some facts:

  1. cull_idle_servers.py assumes the behavior of jupyterhub-singleuser or configurable-http-proxy for activity tracking
  2. binderhub assumes a notebook server, but not jupyterhub-singleuser
  3. notebook servers track activity internally
  4. we've stated a few places that we only want to assume an http server for Binder, but this cannot be achieved if we want to keep the activity tracking necessary for idle culling, unless we deploy a sidecar container in user pods that implements activity tracking (this pod could be configurable-http-proxy!)

I'm not sure exactly what we should do about this, but it's a pretty big issue and a blocker for adopting more resilient proxies in BinderHub. If we continue to assume the notebook server in BinderHub, we can write a new idle-culler that talks directly to the notebook API, ignoring the Hub's activity data. This will be quite inefficient, as it requires lots more requests to notebook servers rather than a single request to the Hub (this can be scaled by sharding the culler). This is also getting closer and closer to not using JupyterHub for anything at all. If we want to skip over that and remove the notebook server assumption, we need to get to work on a sidecar container that at least implements activity tracking (reintroducing the problem of network activity not being as good as internal activity tracking), and possibly also implements auth.

minrk commented 5 years ago

The main reason we don't require jupyterhub-singleuser in repo2docker is that it would be a pain to put it in the containers, since it is sensitive to the version in the Hub pod.

This is another case, I think, for the transparent jupyterhub-auth-proxy discussed occasionally with @yuvipanda, which should implement at least:

  1. auth with hub
  2. activity tracking

Since I think this is mostly useful in container-based deployments like kubernetes, requiring Python so that it can import the jupyterhub auth implementation from jupyterhub itself is probably the best way to go from a maintainability standpoint (as opposed to go, which has benefits for portability, but not something I think our team has the capacity to develop and maintaining at this point). Then this proxy would go in a sidecar container, exerting no requirements on the user container beyond an http endpoint that can run on a prefix.

With that, we could preserve the assumption that the user pods are fully equipped jupyterhub pods (implementing auth, etc.), while also separating it from the user env.

yuvipanda commented 5 years ago

@minrk I think for activity tracking, we can say something like 'we will hit /activity, and it should return the timestamp of last activity' (or something like that) that then becomes a protocol that can be implemented by multiple server implementations. We can do that in a sidecar, in singleuser, etc as we see fit. If we know we're running notebook, this can just use the internal notebook activity tracking. Else it can rely on network or some other mechanism. I don't think tracking this through would be too slow - this is the same as prometheus's model. I think JupyterHub should do this tracking itself internally if possible...

In general, I think it'll be great to explicitly define what a 'jupyterhub equipped pod' means, and we can go from there.

I agree re: go. My ideal would be to find a maintained proxy from somewhere else that can give us the behavior we want purely through configuration rather than requiring us to write and maintain code.

minrk commented 5 years ago

I think JupyterHub should do this tracking itself internally if possible...

I'm not sure what you mean by internally tracking here. The design of JupyterHub is that the Hub is completely not involved during normal user interaction with their own server(s). So it is 100% on the proxy and/or server to implement activity tracking. It is already the case that the Hub is responsible for storing the activity, so it's only one request for Hub API clients to check last_activity of all servers, if that's what you are referring to.

we will hit /activity, and it should return the timestamp of last activity

That's an interesting idea. This will have to be configurable, but we can do it. JupyterHub 1.0 reverses this - singleuser servers push activity rather than the Hub pulling it. The auth circumvention in binderhub makes pull a challenge, since jupyterhub's token auth to the API doesn't work.

In general, I think it'll be great to explicitly define what a 'jupyterhub equipped pod' means, and we can go from there.

Yeah, we can work on this. I believe what we have with JupyterHub 1.0 is:

  1. it's authenticated with oauth (I view binderhub's circumvention of hub auth as something that shouldn't be part of a jupyterhub spec)
  2. it's http(s) on given url
  3. it supports running on a /user/$name/$servername prefix
  4. it tracks internal activity and publishes it to the jupyerhub activity API (not strictly necessary, but required for idle culling with non-default proxies)
  5. we specify the relevant environment variables that contain this info

The biggest challenge with activity tracking when using the jupyterhub-naïve notebook server, either push or pull, is that it requires server extensions and coordinated auth (jupyterhub can't make authenticated requests to binder notebooks). This is a challenge for Binder, where we don't have a good answer for installing deployment-sensitive extensions.

yuvipanda commented 5 years ago

Thanks for the detailed response, @minrk!

By 'internally', I meant the code to make these external calls and keep a note of their last activity status should be in JupyterHub. However, if the tracking is 'push' in 1.0 (I didn't know this!) that sounds awesome and much more efficient. I guess this too can be part of the JupyterHub server protocol as you mention.

Do you think we can formally write that up somewhere?

minrk commented 5 years ago

Do you think we can formally write that up somewhere?

Yes, absolutely. I put the skeleton of what should be included above as a note because I only had a few minutes, but the detailed version of this definitely belongs in the jupyterhub docs. Probably a new page.

adriendelsalle commented 5 years ago

Hi everyone

I'm currently implementing a JupyterHub and a BinderHub and I'm facing some issues with culling. I think it could help to have this little memo in parallel of @minrk list of facts:

We currently have those options:

Other options allow to determine frequency for checking activity/culling/culling users/etc. I focused on kernels/servers culling.

It looks like a lot of stuff to do, mainly at notebook/JLab levels. I don't know how Jupyter works to address priority levels/consistency/user experience between notebook/lab/hub/binder. Can you raise this point as active members of Jupyter?

Feel free to complete/correct my understanding of the situation. I would be happy to contribute!

minrk commented 5 years ago

All of the classic notebook's culling features are available in JupyterLab because those are server-side features and jupyterlab uses the same server (soon a fork of the same server with the same features, but still). JupyterHub's culling in general works just fine with JupyterLab, but can be hindered somewhat by JupyterLab's sometimes overzealous polling behavior (I believe this is the linked lab issue).

I don't think there's necessarily a whole lot to do. Adding internal max-age is easy to do, even via a server extension:

from traitlets.config.application import Application

max_age = 3600 # one hour

def shutdown():
    Application.instance().stop()

IOLoop.current().call_later(max_age, shutdown)

Culling terminals with similar parameters to kernels makes perfect sense.

The JupyterLab polling is a recurring issue, and trying to get JupyterLab to do less with "idle" things (and what counts as idle?) is always a question.

yuvipanda commented 11 months ago

@minrk I think this can be closed, right?

minrk commented 11 months ago

I don't think so. If the JupyterHub chart switched to traefik from chp, binderhub would have to disable the idle culler because it wouldn't work, as the Hub would have no sources of activity for binder pods (unless auth is enabled).

https://github.com/jupyterhub/traefik-proxy/issues/151 is the issue for activity tracking in traefik-proxy, which I think is doable (if we can assume prometheus), but a nontrivial amount of work and has some tricky decisions with tradeoffs to consider.