kubeflow / kubeflow

Machine Learning Toolkit for Kubernetes
https://www.kubeflow.org/
Apache License 2.0
14.16k stars 2.37k forks source link

Jupyter - Culling of Idle Pods #1394

Closed tlkh closed 3 years ago

tlkh commented 6 years ago

JupyterHub has a demo on culling idle pods using a script. This is present as an example on JupyterHub service I have a somewhat ghetto solution to implementing this within Kubeflow using configmap

We add the following configuration into the Config Map jupyterhub-config:

c.JupyterHub.services = [
    {
        'name': 'wget-cull-idle',
        'admin': True,
        'command': ['wget', 'https://raw.githubusercontent.com/jupyterhub/jupyterhub/master/examples/cull-idle/cull_idle_servers.py', '-N']
    },

    {
        'name': 'cull-idle',
        'admin': True,
        'command': ['python', 'cull_idle_servers.py', '--timeout=3600']
    }
]

The solution works by having JupyterHub start two managed services: one to fetch the cull_idle_servers.py file and one to start it. Hope this is useful!

I'm actually wondering if we should start a specific section in the documentation that deals with configuration easily done through configmap (like this one). We could talk about things like form customisation, adding more field (like node selection via labels) etc. I feel there's a lot of potential there. I will be glad to contribute to that documentation.

jlewi commented 5 years ago

It looks like this information is also available from the Jupyter servers themselves. I think its last_activity http://petstore.swagger.io/?url=https://raw.githubusercontent.com/jupyter/notebook/master/notebook/services/api/api.yaml#/kernels/get_api_kernels__kernel_id_

I mention this because we are considering maybe getting rid of JupyterHub #1630

clkao commented 5 years ago

A bit more info on notebook self-culling vs cull-idle-server: https://github.com/kubeflow/kubeflow/issues/56#issuecomment-414688426 the former is also useful in non-jupyterhub scenario if we ever get there.

Note this needs to have the user pod restartPolicy set to OnFailure, which was broken in kubespawner until https://github.com/jupyterhub/kubespawner/pull/251 was merged.

jlewi commented 5 years ago

Adding this to 0.6.

It would be great to make this an option in JupyterNotebook Controller and support it in the UI. A CUJ might be helpful.

For example, it might be nice to terminate idle Jupyter pods without deleting the Jupyter resource so that from the UI a user could restart it with 1 click.

/cc @vkoukis @zabbasi

lluunn commented 5 years ago

cc

kimwnasptd commented 5 years ago

@jlewi we would like to help driving this forward /assign kimwnasptd

jlewi commented 5 years ago

We should try to get this into the controller in 0.7 in preparation for going 1.0.

kimwnasptd commented 5 years ago

@jlewi yes, I was planning to work on this one for 0.7

The overall experience could be the following. The UI would support stopping/starting of the Notebook Pods and we will have a culler, as a sidecar container to the Controller, which would stop idle Notebooks. The user can then just start a stopped Notebook from the UI.

One thing we need to clarify is, what is an idle Pod. If we take for granted that ISTIO would be installed in the cluster so we could use logs from Prometheus, then an idle Pod could be a Pod that doesn't receive http traffic for a {configurable} amount of time.

For this UX, the changes we would need to make would be:

Notebook Controller:

  1. Add a field to the Notebook Spec to specify the number of Pods to be created. We need this to specify the Pods in the StatefulSet to be zero -> stopping the Notebook

Notebook Culler

It will be a sidecar Container in the Notebook controller. It will periodically query Prometheus for the existing Notebook CRs and stop any idle Notebooks. For stopping a Notebook, it will make a PATCH to the CR and set the number of Pods to zero.

Notebooks Manager UI

  1. Give the option to users to stop/start a Notebook. Under the hood the backend will also modify the number of Pods in the CR

@jlewi WDYT

jlewi commented 5 years ago

Why does the culling logic need to be in a side car?

Why can't we just make this a part of the regular notebook controller logic and invoke it in the reconcile function for each notebook?

kimwnasptd commented 5 years ago

I was thinking that we might need this/similar logic also for the Viewer CRD. So instead of repeating the logic in the Viewer Controller we could have it in a central place. Then we could launch the Culling Container as a Sidecar to the Controllers and only change the culler's configuration for each case.

Regarding the Viewer CR, right now the controller has a maximum number of allowed live instances. If a new Viewer is created and the total number exceeds this maximum, then an oldest one will be deleted. Wouldn't we want to have a similar culling logic here?

jlewi commented 5 years ago

@kimwnasptd If we need similar logic in other CRDs; can't we just make it a go library and import that into the viewer controller?

kimwnasptd commented 5 years ago

@jlewi yes, that would also work. I don't have any hard preference as long as we make the code reusable.

Regarding making this part of the Reconciliation loop. We will be making an HTTP request for each Notebook to the Prometheus server, in istio-system. Since the Reconcile function will be called with a high frequency, it would produce a good amount of requests. Thus, this is why I was thinking of making it a standalone process/container that would be triggered periodically.

But if you think this shouldn't be much of a problem, then I can just make it a go package and trigger the logic on the Reconcile function of the controller.

kimwnasptd commented 5 years ago

I think we should be fine with having this in the reconcile function. We can also set to re-trigger the reconcile with specific time intervals. Opened a PR to move forward with the implementation.

jlewi commented 5 years ago

Per the discussion in #3856 I think we might try to use the IDLE signal that the Jupyter kernel provides.

jlewi commented 5 years ago

Based on the discussion in #3856 I wanted to capture some of the next steps here before we mark this feature done

kimwnasptd commented 5 years ago

For having specialized idle time per Notebook we will need to store that information somewhere on the Notebook CR instances. I can think of two options

In either case if this field is not specified then the Controller will stop Notebooks based on the default idle-time. Or it will not stop any Notebook if the culling is disabled altogether.

I have a preference for the first option, since it might be a little bit more portable for other CRs. But I'm good to go with any of the options we will decide on.

jlewi commented 5 years ago

Can we make the behavior

I don't have a strong preference with annotation vs. extending the CR.

I think the main advantage of making it part of the Spec is that its part of the API of the notebook in a more discoverable way that decreases the likelihood of user error.

I'm ok with either option.

I'd optimize for landing this feature in time for 0.7.

therc commented 5 years ago

What about automatic unidling? Both RedHat (in OpenShift) and Deis have idling/unidling controllers that can scale a resource down to zero, then scale it back up when new requests come in, with a proxy forwarding any requests that arrived while no pods were up.

See e.g. https://github.com/deislabs/osiris

jlewi commented 4 years ago

@therc If people want to unidle they would do that through the UI.

@kimwnasptd Any update on this?

jlewi commented 4 years ago

@kimwnasptd any update on this?

Since notebooks are going beta in 0.7 and we don't want more API changes after that; that likely means specific notebook idle times will need to be set via annotation.

jlewi commented 4 years ago

@kimwnasptd Could you provide an update on this specifically what do you think the 1.0 behavior will be?

kimwnasptd commented 4 years ago

Sure! For 1.0 I was thinking to have 2-3 PodDefaults that will represent different culling behaviors. The user will then be able to select witch one of them they want to apply to a new Notebook that they create.

By default JWA won't have any PodDefaults preselected (although the admin can change that in JWA's ConfigMap). This way new Notebooks will not be culled, unless the user explicitly requests it, and also the culling behavior will be tailored on a per Notebook basis.

Also, the UI will allow the user to start a culled Notebook or stop a running one.

I want to ensure this functionality for 1.0 and if I have the time we can re-iterate and add more functionality for 1.0. @jlewi WDYT?

jlewi commented 4 years ago

@kimwnasptd Why would we use PodDefaults? Why not just add a field in the UI that allows people to set the culling time?

kimwnasptd commented 4 years ago

@jlewi I was thinking that most of the time the users will be selecting the same culling times for their different Notebook classes. Most of the time they would want their GPU Notebooks to be up as little as possible and could maybe have two or three different culling times for their normal Notebooks.

Because of this I though it would make more sense to have a drop-down for users to select a predefined culling time rather than setting it directly. This drop-down would also need to be configurable in jwa's config from the admin.

Both of these two could be satisfied with PodDefaults so I preferred it. Do you think this behavior should be more visible and have its own section?

jlewi commented 4 years ago

@kimwnasptd Thank for the explanation. Can we add docs and examples for PodPresets that people can use to setup this behavior? Then we can close this issue out.

nsshah1288 commented 4 years ago

Is there any action on this? Is there any ability in Kubeflow to cull idle pods? We want the ability to shut down Jupyter notebooks after a certain period of idle time.

thesuperzapper commented 4 years ago

It looks like this may have been added in: #3856

But I am not sure we should be tightly integrating the status check with Jupyter itself, (so we can support other notebook types).

jshelman commented 4 years ago

seeing issues with the changes merged from #3856 . the activity check at ../api/status fails with a 403 error: RBAC: access denied.

and seeing log entries in the notebook controller: 020-04-11T01:50:31.165Z INFO culler Warning: GET to http://testnb.namespace.svc.cluster.local/notebook/namespace/testnb/api/status: 403

after enabling the ENABLE_CULLING: true. as an env for the deployment.

descampsk commented 4 years ago

Got the same issue.

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.65

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

@kimwnasptd thoughts?

jshelman commented 4 years ago

i got around this with a simple hack in this case. setting the kubeflow-userid header to a value in the culler allowed the request to complete.

descampsk commented 4 years ago

Could you provide more details to your hack :) ?

jshelman commented 4 years ago

I consider it a hack, because I didn't have time to fully understand enough to fix it correctly. I assume this works because the istio service mesh and or dex is expecting this in the header - even though the api status call is un-secured.

git diff.

+++ b/components/notebook-controller/pkg/culler/culler.go
@@ -142,7 +142,15 @@ func getNotebookApiStatus(nm, ns string) *NotebookStatus {
                "http://%s.%s.svc.%s/notebook/%s/%s/api/status",
                nm, ns, domain, ns, nm)

-       resp, err := client.Get(url)
+       req, err := http.NewRequest("GET", url, nil)
+       if err != nil {
+               log.Info(fmt.Sprintf("Error creating request to %s", url), "error", err)
+               return nil
+       }
+
+       req.Header.Set("kubeflow-userid", "user@email.com")
+
+       resp, err := client.Do(req)
        if err != nil {
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.