ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.03k stars 5.59k forks source link

[Observability] Need Managed Prometheus/Grafana support. ray have tightly coupled with in-cluster Prometheus/Grafana #42232

Open KeyOfSpectator opened 8 months ago

KeyOfSpectator commented 8 months ago

Description

When we deploy Ray / Kuberay to large-cluster and have big scale of data. We need better performance and higher availability of Prometheus + Grafana.

like alibaba managed-prometheus and managed grafana: https://www.alibabacloud.com/product/prometheus

and aws managed-prometheus and managed grafana: https://aws.amazon.com/cn/prometheus/

I found the implement in ray project, have a hard code. We need judge Prometheus/Grafana is healthy, then we can have our Grafana Host and IFrame address showed in the Ray dashboard. https://github.com/ray-project/ray/blob/master/dashboard/modules/metrics/metrics_head.py#L119C39-L119C39

path = f"{self.grafana_host}/{GRAFANA_HEALTHCHECK_PATH}"
try:
    async with self._session.get(path) as resp:
        if resp.status != 200:
            return dashboard_optional_utils.rest_response(
                success=False,
                message="Grafana healtcheck failed",
                status=resp.status,
            )
        json = await resp.json()
        # Check if the required grafana services are running.
        if json["database"] != "ok":
            return dashboard_optional_utils.rest_response(
                success=False,
                message="Grafana healtcheck failed. Database not ok.",
                status=resp.status,
                json=json,
            )

        return dashboard_optional_utils.rest_response(
            success=True,
            message="Grafana running",
            grafana_host=grafana_iframe_host,
            session_name=self._session_name,
            dashboard_uids=self._dashboard_uids,
            dashboard_datasource=self._prometheus_name,
        )

is this possible to give a config, enable/disable the healty check of Grafana/Prometheus?

Use case

When we deploy Ray / Kuberay to large-cluster and have big scale of data. We need better performance and higher availability of Prometheus + Grafana.

like alibaba managed-prometheus and managed grafana: https://www.alibabacloud.com/product/prometheus

and aws managed-prometheus and managed grafana: https://aws.amazon.com/cn/prometheus/

anyscalesam commented 8 months ago

@KeyOfSpectator we actually offer a managed Ray Dashboard offering as part of the Anyscale Platform. See here https://www.anyscale.com/platform for more details.

We also offer managed offering flavors specific to LLM serving in Anyscale Endpoints and Anyscale Private Endpoints

scottsun94 commented 8 months ago

This is not something we can commit to for now due to huge backlog of items we have.. Contribution is welcome.

At the same time, as @anyscalesam mentioned, you can try the managed Ray product so that you don't need to worry about managing the grafana/prometheus by yourself.

KeyOfSpectator commented 8 months ago

ok, thx. I will have a try the managed ray first, but maybe our infra structure is settled. if there need some contribution, maybe i can give some commit.