ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.13k stars 5.61k forks source link

Ray Serve: Fail to create Serve applications #46308

Closed Galeos93 closed 2 months ago

Galeos93 commented 2 months ago

What happened + What you expected to happen

I follow this tutorial to deploy an application using Ray Serve. I get the following error events upon executing kubectl describe rayservice rayservice-sample:

Type    Reason                       Age                  From                   Message
  ----    ------                       ----                 ----                   -------
  Normal  ServiceNotReady              10m (x9 over 10m)    rayservice-controller  The service is not ready yet. Controller will perform a round of actions in 2s.
  Normal  WaitForServeDeploymentReady  47s (x293 over 10m)  rayservice-controller  Fail to create / update Serve applications. If you observe this error consistently, please check "Issue 5: Fail to create / update Serve applications." in https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html#kuberay-raysvc-troubleshoot for more details. err: UpdateDeployments fail: 404 Not Found 404: Not Found

Also applying the command kubectl logs kuberay-operator-7f85d8578-mj4bs | tee operator-log I get the following logs:

{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"Check the head Pod status of the pending RayCluster","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","RayCluster name":"rayservice-sample-raycluster-gzvwz"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","head service name":"rayservice-sample-raycluster-gzvwz-head-svc","namespace":"default"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"FetchHeadServiceURL","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","head service URL":"rayservice-sample-raycluster-gzvwz-head-svc.default.svc.cluster.local:8265","port":"dashboard"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"shouldUpdate","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","shouldUpdateServe":true,"reason":"Nothing has been cached for cluster rayservice-sample-raycluster-gzvwz with key default/rayservice-sample/rayservice-sample-raycluster-gzvwz"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"updateServeDeployment","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","V2 config":"applications:\n  - name: text_ml_app\n    import_path: text_ml.app\n    route_prefix: /summarize_translate\n    runtime_env:\n      working_dir: \"https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip\"\n      pip:\n        - torch\n        - transformers\n    deployments:\n      - name: Translator\n        num_replicas: 1\n        ray_actor_options:\n          num_cpus: 0.1\n        user_config:\n          language: french\n      - name: Summarizer\n        num_replicas: 1\n        ray_actor_options:\n          num_cpus: 0.1\n"}
{"level":"info","ts":"2024-06-27T20:02:25.020Z","logger":"controllers.RayService","msg":"updateServeDeployment","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","MULTI_APP json config":"{\"applications\":[{\"deployments\":[{\"name\":\"Translator\",\"num_replicas\":1,\"ray_actor_options\":{\"num_cpus\":0.1},\"user_config\":{\"language\":\"french\"}},{\"name\":\"Summarizer\",\"num_replicas\":1,\"ray_actor_options\":{\"num_cpus\":0.1}}],\"import_path\":\"text_ml.app\",\"name\":\"text_ml_app\",\"route_prefix\":\"/summarize_translate\",\"runtime_env\":{\"pip\":[\"torch\",\"transformers\"],\"working_dir\":\"https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip\"}}]}"}
{"level":"error","ts":"2024-06-27T20:02:25.034Z","logger":"controllers.RayService","msg":"Fail to reconcileServe.","RayService":{"name":"rayservice-sample","namespace":"default"},"reconcileID":"564ff6f4-8af6-4021-ae79-2fb461da806c","error":"Fail to create / update Serve applications. If you observe this error consistently, please check \"Issue 5: Fail to create / update Serve applications.\" in https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.html#kuberay-raysvc-troubleshoot for more details. err: UpdateDeployments fail: 404 Not Found 404: Not Found","stacktrace":"github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayServiceReconciler).Reconcile\n\t/home/runner/work/kuberay/kuberay/ray-operator/controllers/ray/rayservice_controller.go:169\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227"}

I expected the rayservice to start without issues, as shown in the tutorial.

Versions / Dependencies

Reproduction script

I follow the tutorial in here after deploying an EKS cluster in AWS, using 2 nodes of t3.medium type. The service configuration I use is not the same as in the tutorial, I have set less resources:

# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  # Only one of serveConfig and serveConfigV2 should be used.
  serveConfigV2: |
    applications:
      - name: text_ml_app
        import_path: text_ml.app
        route_prefix: /summarize_translate
        runtime_env:
          working_dir: "https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip"
          pip:
            - torch
            - transformers
        deployments:
          - name: Translator
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
            user_config:
              language: french
          - name: Summarizer
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
  rayClusterConfig:
    rayVersion: '2.6.3' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.6.3
              resources:
                limits:
                  cpu: "500m"
                  memory: 1Gi
                requests:
                  cpu: "500m"
                  memory: 1Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.6.3
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "500m"
                    memory: "1Gi"
                  requests:
                    cpu: "500m"
                    memory: "1Gi"

Issue Severity

High: It blocks me from completing my task.

Galeos93 commented 2 months ago

After using a newer ray version (2.9.0), the issue was solved. Here is the yaml I used:

# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  # Only one of serveConfig and serveConfigV2 should be used.
  serveConfigV2: |
    applications:
      - name: text_ml_app
        import_path: text_ml.app
        route_prefix: /summarize_translate
        runtime_env:
          working_dir: "https://github.com/ray-project/serve_config_examples/archive/36862c251615e258a58285934c7c41cffd1ee3b7.zip"
          pip:
            - torch
            - transformers
        deployments:
          - name: Translator
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.2
            user_config:
              language: french
          - name: Summarizer
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.2
  rayClusterConfig:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              resources:
                limits:
                  cpu: 1
                  memory: 2Gi
                requests:
                  cpu: 1
                  memory: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: "1"
                    memory: "2Gi"
                  requests:
                    cpu: "500m"
                    memory: "2Gi"