ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
990 stars 330 forks source link

feat: record last state transition times #2053

Closed davidxia closed 2 months ago

davidxia commented 3 months ago

Why are these changes needed?

Problem Statement

My ML platform team runs the kuberay ray-operator. We want to measure the time it takes for RayCluster's to transition from their initial "unhealthy" state to some other state. This metric is important for us because our users want their RayClusters to start in a timely manner. It seems like neither the ray-operator nor RayClusters provide this info currently.

Design

Add a new .status.stateTransitionTimes field to the RayCluster custom resource. This field is a map[ClusterState]*metav1.Time that indicates the time of the last state transition for each state. This field is updated whenever the .status.state changes.

manual testing steps:

  1. make manifests generate
  2. kubectl config use-context CONTEXT
  3. make docker-build docker-push deploy IMG=europe-west4-docker.pkg.dev/spotify-workbench/images/operator:$USER-$(git rev-parse --short=7 HEAD)
  4. kubectl --context CONTEXT apply -f /path/to/raycluster.yaml
  5. verify there's a .status.stateTransitionTimes in the output of kubectl --context CONTEXT -n NAMESPACE get rayclusters NAME -o yaml

Related issue number

Checks

kevin85421 commented 2 months ago

@sydneyw-spotify just saw you update this PR recently. Feel free to ping me whenever this PR is ready for review.

davidxia commented 2 months ago

@kevin85421, thanks. This is ready for review now. Example input and output in the info gist link in PR description.

davidxia commented 2 months ago

LGTM. I will open a follow up to move the envtest to a better place.

Thank you!