Open thesuperzapper opened 2 months ago
@jiridanek @Adembc, @kimwnasptd, or @ederign you might be interested in picking this up, it's one of the two remaining tasks to finish the Notebooks 2.0 controller.
I will take it @thesuperzapper
@Adembc I realized that we also need to add spec.podTemplate.culling.activityProbe.jupyter.portId
so we know which port to probe with the Jupyter HTTP requests.
@Adembc as we discussed today, we need to make a few changes to the implementation:
minimumProbeInterval
to maxProbeInterval
, as it is more correct to say the "maximum time between probes", because this is the maximum INTERVAL:minProbeInterval
(on WorkspaceKind) which limits how frequently the probes can be made (to avoid spamming with probes on failure)minProbeInterval
, we need to add the following new status fields to Workspace:status:
activity:
...
## information about the last activity probe
lastProbe:
## the time the probe was started (UNIX epoch in milliseconds)
startTimeMs: 1710435303000
## the time the probe was completed (UNIX epoch in milliseconds)
endTimeMs: 1710435305000
## the result of the probe
## - ENUM: "Success" | "Failure" | "Timeout"
result: "Success"
## a human-readable message about the probe result
## - WARNING: this field is NOT FOR MACHINE USE, subject to change without notice
## - EXAMPLES:
## - "Jupyter probe succeeded"
## - "Jupyter probe failed: HTTP 500"
## - "Jupyter probe failed: invalid response body"
## - "Jupyter probe failed: timeout after 5000ms"
## - "Bash probe succeeded"
## - "Bash probe failed: unexpected exit code 100"
## - "Bash probe failed: timeout after 5000ms"
message: ""
kubectl
package into the controllerkuberntes-go
library, similar to what this stack overflow said. However, this might not work because they recently switched to using WebSockets on the Kubernets exec API.@Adembc As discussed in the meeting, we should probably extend the exec
type probe to actually be a script which writes a file to the disk, rather than a exit-code based system.
For example, we might update the WorkspaceKind to have the following fields under spec.podTemplate.culling.activityProbe.exec
:
spec:
podTemplate:
culling:
## the probe used to determine if the Workspace is active
##
activityProbe:
## OPTION 1: a custom probe
##
exec:
## the script should write a JSON file at this path
## - any existing file at this path will be REMOVED before the script is run
## - the JSON object should have ONE of the following fields:
## - `has_activity`: a boolean indicating if the Workspace was active in the last 60 seconds
## - `last_activity`: the last activity time in ISO 8601 format (e.g. "2030-01-01T00:00:00Z")
## - if both fields are present, `has_activity` will be used
##
outputPath: "/tmp/activity_probe.json"
## the number of seconds to wait for the script to complete
## - the probe will be considered a failure if the script does not complete in time
## - workspaces with failing activity probes will NOT be culled
##
timeoutSeconds: 60
## the script to run to determine if the Workspace is active
## - the script must exit with a 0 status code unless there is an error
## - workspaces with failing activity probes will NOT be culled
## - the script must have a shebang (e.g. `#!/usr/bin/env bash` or `#!/usr/bin/env python`)
## - the script should be idempotent and without side effects, it may be run multiple times
## - typically, it will be more efficient to write a probe which checks for a specific
## activity indicator agreed with your users, rather than checking the entire filesystem
##
script: |-
#!/usr/bin/env bash
set -euo pipefail
# Define the output path
output_path="/tmp/activity_probe.json"
# Find the most recent modification time in the $HOME directory
last_activity_epoch=$(find "$HOME" -type f -printf '%T@\n' 2>/dev/null | awk 'max < $1 { max = $1 } END { print max }')
# Write the last activity time to the output path
if [ -n "$last_activity_epoch" ]; then
# Convert epoch time to ISO 8601 format
last_activity=$(date -d "@$last_activity_epoch" -Iseconds)
echo "{\"last_activity\": \"$last_activity\"}" > "$output_path"
else
# Handle the case where no files are found
echo "{\"last_activity\": null}" > "$output_path"
fi
related https://github.com/kubeflow/kubeflow/issues/7156
Now that https://github.com/kubeflow/notebooks/pull/34 is merged, there are only a small number of remaining tasks for the "Kubeflow Workspaces Controller to be finished.
One task is to implement the culling functionality of the WorkspaceKinds, which will automatically pause Workspaces which are not used for a period of time. Unlike in Notebooks 1.0, we now support arbitrary "activity probes", in addition to a direct integration with JupyterLab.
Implementation
We will create a separate controller loop that checks for Workspaces needing a probe on some regular interval, and triggers a “reconcile” (which is actually a “probe” in this case, but the function is still called
Reconcile()
) and then either culls the Workspace or updates its status information).Definition of “workspace needing probe”:
CRDs
This section of the
WorkspaceKind
definition is where users will define and configure their culling probes:Changes from current CRDs
After designing and discussing this implementation, we figured that we need to make some small additions to the existing Notebooks 2.0 CRDs:
spec.disableCulling
which defaults tofalse
, and iftrue
we will never cull the Workspace.status.activity.lastProbe
(see https://github.com/kubeflow/notebooks/issues/38#issuecomment-2338852756)spec.podTemplate.culling.minProbeInterval
which determines the minimum period between each Workspace probe (this prevents too many probes being made when the probe is failing)spec.podTemplate.culling.maxProbeInterval
which determines the maximum period between each Workspace probe (this ensures we probe at least this frequently, to ensure our UI information is "fresh").spec.podTemplate.culling.activityProbe.jupyter.portId
to select which port to run the probe against.Probes
Jupyter Probe
The culling controller will do the following on its "reconciliation" loop for the Jupyter-type probe:
/api/status
endpoint of the Pod:/exec
api to make an HTTP request also, not surelast_activity
field returned by the Jupyter API:Bash Probe
The culling controller will do the following on its "reconciliation" loop for the Bash-type probe:
/exec
API:Future Work