ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.29k stars 413 forks source link

[Bug] Application-Level Metrics works locally but fails when deployed as rayservice #2553

Open rajendra-avesha opened 6 days ago

rajendra-avesha commented 6 days ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

I need to expose application level metrics on ray serve application; which shall be consumed by Prometheus I tried to use the Gauge from ray.serve.metrics module. Please find the reference code as follows

from ray import serve
from ray.serve.metrics import Gauge

from starlette.responses import JSONResponse
import psutil

@serve.deployment
class MyDeployment:
    def __init__(self):
        self.num_requests = 0
        self.my_gauge = Gauge(
            "memory_usage_bytes",
            description="Memory usage of the current process in bytes.",
            tag_keys=("model",),
        )
        self.my_gauge.set_default_tags({"model": "123"})

    async def __call__(self, request):
        # Update the request count
        self.num_requests += 1

        # Get current memory usage
        process = psutil.Process()
        memory_usage = process.memory_info().rss

        # Update the gauge metric
        self.my_gauge.set(memory_usage)

        # Return a response
        return JSONResponse({
            "message": "Metrics updated!",
            "memory_usage_bytes": memory_usage,
            "total_requests": self.num_requests,
        })

app = MyDeployment.bind()

which is sample code provided by ray documentation When this code is run locally using serve run as follows

serve run deploy:app
2024-11-18 22:47:15,041 INFO scripts.py:499 -- Running import path: 'deploy:app'.
2024-11-18 22:47:15,054 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 192.168.225.219:6379...
2024-11-18 22:47:15,060 INFO worker.py:1744 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
2024-11-18 22:47:16,419 INFO handle.py:126 -- Created DeploymentHandle '9y048115' for Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:16,419 INFO handle.py:126 -- Created DeploymentHandle 'pu69dec0' for Deployment(name='MyDeployment', app='default').
(ServeController pid=77471) INFO 2024-11-18 22:47:16,454 controller 77471 deployment_state.py:1598 - Deploying new version of Deployment(name='MyDeployment', app='default') (initial target replicas: 1).
(ProxyActor pid=77474) INFO 2024-11-18 22:47:16,397 proxy 192.168.225.219 proxy.py:1165 - Proxy starting on node 9c1fd0028a2d1265ec47f7e6105d318b0176767ca6800b6754419452 (HTTP port: 8000).
(ServeController pid=77471) INFO 2024-11-18 22:47:16,556 controller 77471 deployment_state.py:1844 - Adding 1 replica to Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:17,428 INFO handle.py:126 -- Created DeploymentHandle 'b8lhc4lw' for Deployment(name='MyDeployment', app='default').
2024-11-18 22:47:17,429 INFO api.py:584 -- Deployed app 'default' successfully.
(ServeReplica:default:MyDeployment pid=77479) INFO 2024-11-18 22:47:20,498 default_MyDeployment 9goiuke5 8e990763-875e-42d8-a014-1b8047f9a9c1 /GenericModelApp1/GM1v1 replica.py:373 - __CALL__ OK 1.7ms
^C2024-11-18 22:48:23,304       WARNING api.py:592 -- Got KeyboardInterrupt, exiting...
2024-11-18 22:48:23,305 INFO scripts.py:585 -- Got KeyboardInterrupt, shutting down...
(ServeController pid=77471) INFO 2024-11-18 22:48:23,351 controller 77471 deployment_state.py:1860 - Removing 1 replica from Deployment(name='MyDeployment', app='default').
(ServeController pid=77471) INFO 2024-11-18 22:48:25,388 controller 77471 deployment_state.py:2182 - Replica(id='9goiuke5', deployment='MyDeployment', app='default') is stopped.

The custom metric ray_memory_usage_bytes is available at http://127.0.0.1:8080/ please refer to serverun.txt But the same source file when containerised and deployed using RayService.yaml as follows:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: rayservice-customer1
  namespace: customer1
spec:
  serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray Serve applications. Default value is 900.
  deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for Ray dashboard agent. Default value is 300.
  serveConfigV2: |
      applications:
      - name: DeployApp
        import_path: deploy:app
        route_prefix: /app
        runtime_env: {}
        deployments:
        - name: MyDeployment
          max_concurrent_queries: 100
          autoscaling_config:
            metrics_interval_s: 0.1
            min_replicas: 1
            max_replicas: 5
            upscale_delay_s: 1
            downscale_delay_s: 2
            look_back_period_s: 2
            target_num_ongoing_requests_per_replica: 5
          ray_actor_options:
            num_cpus: 0.1

  rayClusterConfig:
    rayVersion: '2.32.0' # should match the Ray version in the image of the containers
    ## raycluster autoscaling config
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      resources:
        limits:
          cpu: 1
          memory: "1000Mi"
        requests:
          cpu: 1
          memory: "1000Mi"
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
        num-cpus: "0"
        # Include the dashboard
        include-dashboard: "true"
        # Set the metrics export port
        metrics-export-port: "9080"
      #pod template
      template:
        metadata:
          annotations:
            prometheus.io/scrape: "true"
            prometheus.io/port: "9080"
            prometheus.io/path: "/metrics"
        spec:
          imagePullSecrets:
            - name: test-docker
          containers:
            - name: ray-head
              image:ray-base-image:0.0.2-SNAPSHOT-a18d3e13
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  cpu: 1
                  memory: 2Gi
                requests:
                  cpu: 1
                  memory: 2Gi
              env:
                - name: RAY_memory_usage_threshold
                  value: "0.90"  # Adjust threshold as needed
                - name: RAY_memory_monitor_refresh_ms
                  value: "0"  # Disable memory monitoring
                - name: RAY_GRAFANA_IFRAME_HOST
                  value: http://127.0.0.1:3000
                - name: RAY_GRAFANA_HOST
                  value: http://prometheus-grafana.prometheus-system.svc:80
                - name: RAY_PROMETHEUS_HOST
                  value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
                - name: RAY_LOG_LEVEL
                  value: debug
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
                - containerPort: 9080
                  name: metrics
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: worker
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams:
          metrics-export-port: "9080"
        #pod template
        template:
          metadata:
            annotations:
              prometheus.io/scrape: "true"
              prometheus.io/port: "9080"
              prometheus.io/path: "/metrics"
          spec:
            volumes:
              - name: data
                emptyDir: {}
            imagePullSecrets:
              - name:test-docker
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image:ray-base-image:0.0.2-SNAPSHOT-a18d3e13 #edfd0115 # #b6a89258
                imagePullPolicy: IfNotPresent
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh","-c","ray stop"]
                resources:
                  limits:
                    cpu: 2
                    memory: 3Gi
                  requests:
                    cpu: 2
                    memory: 3Gi
                env:
                  - name: RAY_LOG_LEVEL
                    value: debug
                volumeMounts:
                  - name: data
                    mountPath: /data
                ports:
                  - containerPort: 9080
                    name: metrics

serverun.txt

Reproduction script

And created custom resource at rayservicedesc.txt

Port forwarding 9080 port of rayservice-customer1-head-svc service in customer1 namepace the custom metric is not available but ray and system metrics are available please find attached rayservice.txt Rayservice.txt

I am not sure what is missing here. Intially I tried with default 8080 port latter changed to 9080 port check if metrics-export-port is functional Please provide your inputs to debug further

Anything else

I tried multiple times

Are you willing to submit a PR?

rajendra-avesha commented 4 days ago

Please find the further analysis of this issue I am able to find my custom metric available on the worker pod localhost:9080/metrics. (had verified running curl on the http://127.0.0.1:9080/metrics) I tried to explore on the services created by rayservice kubectl get svc -n customer1 -owide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR rayservice-customer1-head-svc ClusterIP 10.0.253.179 <none> 10001/TCP,8265/TCP,6379/TCP,9080/TCP,8000/TCP 8m33s app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/identifier=rayservice-customer1-raycluster-mj72p-head,ray.io/node-type=head rayservice-customer1-raycluster-mj72p-head-svc ClusterIP 10.0.234.131 <none> 10001/TCP,8265/TCP,6379/TCP,9080/TCP,8000/TCP 9m13s app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/identifier=rayservice-customer1-raycluster-mj72p-head,ray.io/node-type=head rayservice-customer1-serve-svc ClusterIP 10.0.17.9 <none> 8000/TCP 8m33s ray.io/cluster=rayservice-customer1-raycluster-mj72p,ray.io/serve=true When I tried to verify on the service rayservice-customer1-raycluster-mj72p-head-svc on port 9080 I couldn't find the metric. I tried on the other service too.

Is this both service tied to head as it has selector ray.io/node-type=head

Is my rayservice configuration is correct can you please review

kevin85421 commented 4 days ago

Hi @rajendra-avesha, this thread https://ray.slack.com/archives/CNCKBBRJL/p1730741501573559 might be useful. If you still have the issue, feel free to reach out to us on the KubeRay Slack or reply to this issue.