kubeflow / notebooks

Kubeflow Notebooks lets you run web-based development environments on your Kubernetes cluster by running them inside Pods.
Apache License 2.0
16 stars 11 forks source link

Notebooks 2.0 // Controller // Implement Culling #38

Open thesuperzapper opened 2 months ago

thesuperzapper commented 2 months ago

related https://github.com/kubeflow/kubeflow/issues/7156

Now that https://github.com/kubeflow/notebooks/pull/34 is merged, there are only a small number of remaining tasks for the "Kubeflow Workspaces Controller to be finished.

One task is to implement the culling functionality of the WorkspaceKinds, which will automatically pause Workspaces which are not used for a period of time. Unlike in Notebooks 1.0, we now support arbitrary "activity probes", in addition to a direct integration with JupyterLab.

Implementation

We will create a separate controller loop that checks for Workspaces needing a probe on some regular interval, and triggers a “reconcile” (which is actually a “probe” in this case, but the function is still called Reconcile()) and then either culls the Workspace or updates its status information).

Definition of “workspace needing probe”:

CRDs

This section of the WorkspaceKind definition is where users will define and configure their culling probes:

spec:
  podTemplate:

    ## activity culling configs (MUTABLE)
    ##  - for pausing inactive Workspaces
    ##
    culling:

      ## if the culling feature is enabled
      ##
      enabled: true

      ## the maximum number of seconds a Workspace can be inactive
      ##
      maxInactiveSeconds: 86400

      ## the probe used to determine if the Workspace is active
      ##
      activityProbe:

        ## OPTION 1: a shell command probe
        ##  - if the Workspace had activity in the last 60 seconds this command
        ##    should return status 0, otherwise it should return status 1
        ##
        #exec:
        #  command:
        #    - "bash"
        #    - "-c"
        #    - "exit 0"

        ## OPTION 2: a Jupyter-specific probe
        ##  - will poll the `/api/status` endpoint of the Jupyter API, and use the `last_activity` field
        ##    https://github.com/jupyter-server/jupyter_server/blob/v2.13.0/jupyter_server/services/api/handlers.py#L62-L67
        ##  - note, users need to be careful that their other probes don't trigger a "last_activity" update
        ##    e.g. they should only check the health of Jupyter using the `/api/status` endpoint
        ##
        jupyter:
          lastActivity: true

          ## NOTE: this is NEW, otherwise, if multiple ports are defined, we dont know which to probe
          portId: "jupyterlab"

Changes from current CRDs

After designing and discussing this implementation, we figured that we need to make some small additions to the existing Notebooks 2.0 CRDs:

Probes

Jupyter Probe

The culling controller will do the following on its "reconciliation" loop for the Jupyter-type probe:

  1. Make an HTTP request to the /api/status endpoint of the Pod:
    • ?? it might be possible to use the Kubernetes /exec api to make an HTTP request also, not sure
  2. Set the “Last update” to the time we STARTED sending the request.
  3. Set the “Last activity” to the last_activity field returned by the Jupyter API:
  4. Update the status of the Workspace resource with the new activity values.
  5. Apply culling:
    • If the “last activity” is more than “Max Inactive Seconds” ago (with 5 seconds of buffer), cull the Workspace (unless “disable culling” is true for the Workspace)

Bash Probe

The culling controller will do the following on its "reconciliation" loop for the Bash-type probe:

  1. Send bash commands to the pod via the Kubernetes /exec API:
    • If the bash command exists with anything other than 1 or 0, don’t continue and possibly raise a warning in the logs.
    • NOTE: exit 0 -> there has not been activity in the past 60 seconds
    • NOTE: exit 1 -> there has been activity in the past 60 seconds
    • We should time out the bash command after 55 seconds (and assume that there WAS activity):
      • !! we need to document that the bash probe must take less than 55 seconds.
  2. Set “Last Update” to the time we STARTED sending the bash command to the pod.
  3. Set “Last Activity”:
    • If the bash probe exits with 0, the “last activity” should not be updated (unless it’s currently 0, then it should be the current time)
    • If the bash probe exits with 1, the “last activity” should be set to the current time.
  4. Update the status of the Workspace resource with the new activity values.
  5. Apply culling:
    • If the “last activity” is more than “Max Inactive Seconds” ago (with 5 seconds of buffer), cull the Workspace (unless “disable culling” is true for the Workspace)

Future Work

thesuperzapper commented 2 months ago

@jiridanek @Adembc, @kimwnasptd, or @ederign you might be interested in picking this up, it's one of the two remaining tasks to finish the Notebooks 2.0 controller.

Adembc commented 2 months ago

I will take it @thesuperzapper

thesuperzapper commented 2 months ago

@Adembc I realized that we also need to add spec.podTemplate.culling.activityProbe.jupyter.portId so we know which port to probe with the Jupyter HTTP requests.

thesuperzapper commented 1 month ago

@Adembc as we discussed today, we need to make a few changes to the implementation:

  1. Rename minimumProbeInterval to maxProbeInterval, as it is more correct to say the "maximum time between probes", because this is the maximum INTERVAL:
  2. Add a new minProbeInterval (on WorkspaceKind) which limits how frequently the probes can be made (to avoid spamming with probes on failure)
  3. To enable minProbeInterval, we need to add the following new status fields to Workspace:
status:
  activity:
    ...

    ## information about the last activity probe
    lastProbe:
      ## the time the probe was started (UNIX epoch in milliseconds)
      startTimeMs: 1710435303000

      ## the time the probe was completed (UNIX epoch in milliseconds)
      endTimeMs: 1710435305000

      ## the result of the probe
      ##  - ENUM: "Success" | "Failure" | "Timeout"
      result: "Success"

      ## a human-readable message about the probe result
      ##  - WARNING: this field is NOT FOR MACHINE USE, subject to change without notice
      ##  - EXAMPLES:
      ##     - "Jupyter probe succeeded"
      ##     - "Jupyter probe failed: HTTP 500"
      ##     - "Jupyter probe failed: invalid response body"
      ##     - "Jupyter probe failed: timeout after 5000ms"
      ##     - "Bash probe succeeded"
      ##     - "Bash probe failed: unexpected exit code 100"
      ##     - "Bash probe failed: timeout after 5000ms"
      message: ""
  1. When implementing the bash probe, we probably need to either:
    1. Import the whole kubectl package into the controller
    2. OR: if possible, use the existing kuberntes-go library, similar to what this stack overflow said. However, this might not work because they recently switched to using WebSockets on the Kubernets exec API.
thesuperzapper commented 1 month ago

@Adembc As discussed in the meeting, we should probably extend the exec type probe to actually be a script which writes a file to the disk, rather than a exit-code based system.

For example, we might update the WorkspaceKind to have the following fields under spec.podTemplate.culling.activityProbe.exec:

spec:
  podTemplate:
    culling:

      ## the probe used to determine if the Workspace is active
      ##
      activityProbe:

        ## OPTION 1: a custom probe
        ##
        exec:
          ## the script should write a JSON file at this path
          ##  - any existing file at this path will be REMOVED before the script is run
          ##  - the JSON object should have ONE of the following fields:
          ##     - `has_activity`: a boolean indicating if the Workspace was active in the last 60 seconds
          ##     - `last_activity`: the last activity time in ISO 8601 format (e.g. "2030-01-01T00:00:00Z")
          ##  - if both fields are present, `has_activity` will be used
          ##
          outputPath: "/tmp/activity_probe.json"

          ## the number of seconds to wait for the script to complete
          ##  - the probe will be considered a failure if the script does not complete in time
          ##  - workspaces with failing activity probes will NOT be culled
          ##
          timeoutSeconds: 60

          ## the script to run to determine if the Workspace is active
          ##  - the script must exit with a 0 status code unless there is an error
          ##  - workspaces with failing activity probes will NOT be culled
          ##  - the script must have a shebang (e.g. `#!/usr/bin/env bash` or `#!/usr/bin/env python`)
          ##  - the script should be idempotent and without side effects, it may be run multiple times
          ##  - typically, it will be more efficient to write a probe which checks for a specific
          ##    activity indicator agreed with your users, rather than checking the entire filesystem
          ##
          script: |-
            #!/usr/bin/env bash

            set -euo pipefail

            # Define the output path
            output_path="/tmp/activity_probe.json"

            # Find the most recent modification time in the $HOME directory
            last_activity_epoch=$(find "$HOME" -type f -printf '%T@\n' 2>/dev/null | awk 'max < $1 { max = $1 } END { print max }')

            # Write the last activity time to the output path
            if [ -n "$last_activity_epoch" ]; then
                # Convert epoch time to ISO 8601 format
                last_activity=$(date -d "@$last_activity_epoch" -Iseconds)
                echo "{\"last_activity\": \"$last_activity\"}" > "$output_path"
            else
                # Handle the case where no files are found
                echo "{\"last_activity\": null}" > "$output_path"
            fi