ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
990 stars 330 forks source link

[Bug] when using workerGroupSpec by KubeApiServer, kuberay lost the NumOfHosts parameter #2044

Closed kuaikuai closed 3 months ago

kuaikuai commented 3 months ago

Search before asking

KubeRay Component

ray-operator, apiserver

What happened + What you expected to happen

use kuberay 1.1.0

curl --silent -X 'POST' \
  'http://localhost:31888/apis/v1/namespaces/ray-system/clusters' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "test-cluster",
  "namespace": "ray-system",
  "user": "3cpo",
  "version": "2.9.0",
  "environment": "DEV",
  "clusterSpec": {
    "headGroupSpec": {
      "computeTemplate": "default-template",
      "image": "rayproject/ray:2.9.0",
      "serviceType": "NodePort",
      "rayStartParams": {
        "dashboard-host": "0.0.0.0",
        "metrics-export-port": "8080"
      },
      "volumes": []
    },
    "workerGroupSpec": [
      {
        "groupName": "small-wg",
        "computeTemplate": "default-template",
        "image": "rayproject/ray:2.9.0",
        "replicas": 1,
        "minReplicas": 1,
        "maxReplicas": 1,
        "numOfHosts": 1,
        "rayStartParams": {
          "node-ip-address": "$MY_POD_IP"
        }
      }
    ]
  }
}'

There were no small-wg pods in k8s.

Reproduction script

`
func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv1.RayCluster) error {
                //....
        runningPods := corev1.PodList{}
        for _, pod := range workerPods.Items {
            if _, ok := deletedWorkers[pod.Name]; !ok {
                runningPods.Items = append(runningPods.Items, pod)
            }
        }
        // A replica can contain multiple hosts, so we need to calculate this based on the number of hosts per replica.
               //  HERE: worker.NumOfHosts is always 0, numExpectedPods == 0, worker pods will not be created
        numExpectedPods := workerReplicas * worker.NumOfHosts
        diff := numExpectedPods - int32(len(runningPods.Items))`

Anything else

No response

Are you willing to submit a PR?

heiruwu commented 3 months ago

Hi @kuaikuai I'm facing the same issue, did you find the cause or any solution to this?

kuaikuai commented 3 months ago

@heiruwu My Ray cluster was updated from version 1.0.0 to version 1.1.0. However, when the kuberay-operator and kuberay-apiserver were uninstalled using Helm, the custom resource definitions (CRDs) were not removed. In version 1.1.0 of Kuberay, the CRDs have a new configuration parameter called "NumOfHosts" with a default value of 1, which was not present in version 1.0.0. To resolve this, I deleted the CRDs and reinstalled version 1.1.0.

To delete the CRDs, I used the following commands:

kubectl delete crd rayclusters.ray.io -n mynamespace kubectl delete crd rayjobs.ray.io -n mynamespace kubectl delete crd rayservices.ray.io -n mynamespace

heiruwu commented 3 months ago

thank you so much! this is exactly what I needed. I read it somewhere that the CRDs need to be manually deleted when upgrading kuberay from 1.0.0->1.1.0, but totally forgot about it