ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Umbrella] Ray Autoscaling tests #2173

Open kevin85421 opened 1 month ago

kevin85421 commented 1 month ago

Search before asking

Description

Currently, there are some autoscaling end-to-end tests for KubeRay in the Ray repository. However, there are some issues with the tests:

Here, we plan to build new autoscaling end-to-end tests in the KubeRay repository.

Progress

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Irvingwangjr commented 3 weeks ago

https://kuttl.dev/docs/#pre-requisites maybe we can consider using this tool?

kevin85421 commented 3 weeks ago

@Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to execute some commands in the Pods to trigger certain behaviors.

Irvingwangjr commented 3 weeks ago

@Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to execute some commands in the Pods to trigger certain behaviors.

yeah, that might be a problem, right now we use TestStep to execute scripts to trigger behaviors. here is an example, we try to simulate the eviction of ray head, and check whether the RayJob will eventually enter the failed state:

apiVersion: kuttl.dev/v1beta1
kind: TestStep
commands:
  - script: |
      pod_name=$(kubectl get pod -l "xx.com/ray-pod-name=mlrayjob-head-failed-on-running-head-0" -n=ns -o jsonpath="{.items[0].metadata.name}")

      cmd="kubectl exec -it $pod_name -n=ns -c ray-container -- dd if=/dev/zero of=/tmp/test.txt bs=1M count=100000"

      $cmd &

      cmd_pid=$!

      wait $cmd_pid

      exit_code=$?
      if [ $exit_code -eq 137 ]; then
          echo "the process was killed, exit with return code of 137"
      else
          echo "the process was killed, exit with return code of $exit_code"
      fi

then we assert the RayJob to be in status of failed

apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
- command: kubectl assert exist-enhanced rayjob/mlrayjob-head-failed-on-running -n=ns --field-selector status.phase=Failed

we also use kube-assert here, it might be helpful.

Irvingwangjr commented 3 weeks ago

https://github.com/open-feature/open-feature-operator OpenFeature(an CNCF project) adopt this tool, it also provides some examples