ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.83k stars 5.57k forks source link

[Core][Metrics] Metrics cannot be reported in the Driver process. #36813

Open MissiontoMars opened 1 year ago

MissiontoMars commented 1 year ago

What happened + What you expected to happen

We use the metrics api to report metrics, and the backend is replaced by an in-house implementation. When I was looking for problems that metrics failed to report, I noticed something very strange.

I reproduced the problem in ray 2.3.1(without any code changes, only few debug log).

  1. start ray RAY_enable_metrics_collection=true ray start --head --include-dashboard=true --dashboard-host="0.0.0.0"
  2. submit the following test script RAY_ADDRESS=http://127.0.0.1:8265 ray job submit --runtime-env-json='{"working_dir": "./"}' -- python3.7 metric.py
import ray
import time

from ray.util.metrics import Gauge, Counter
import os

class TestSimpleMetric():
    def __init__(self):
        self.gauge = Gauge(
            "my_test_gauge_name",
            description="my test gauge.",
            tag_keys=("submission_id",),
        )

    def report(self):
        self.gauge.set(200, tags={"submission_id": "default_submission_id"},)

metric = TestSimpleMetric()

@ray.remote
class MyActor():
    def __init__(self):
        pass

actor = MyActor.remote()

metric.report()

import time
time.sleep(1000)

To check whether metric is reported, a debug log is added to metrics_agent.py: https://github.com/MissiontoMars/ray/blob/releases_2.3.1/python/ray/_private/metrics_agent.py#L466C1-L466C1

When the job script running, i can see the debug logs about the metric in dashboard_agent.log, which indicates the metric is being reported.

But if I comment out this line of code: actor = MyActor.remote(), there is no logs. This is so weird.

Versions / Dependencies

ray 2.3

Reproduction script

None

Issue Severity

High: It blocks me from completing my task.

rkooo567 commented 1 year ago

I think it is because you didn't call ray.init, and it is only triggered when actor = MyActor.remote() is called. We should probably add a better warning (or auto-init) when metrics APIs are used without ray.init.

@MissiontoMars do you have some bandwidth to take the issue?

MissiontoMars commented 1 year ago

I think it is because you didn't call ray.init, and it is only triggered when actor = MyActor.remote() is called. We should probably add a better warning (or auto-init) when metrics APIs are used without ray.init.

@MissiontoMars do you have some bandwidth to take the issue?

That's exactly the reason!

I perfer to auto-init when metrics API are used. What's the appropriate way to auto-init? Just call ray.init()?

MissiontoMars commented 1 year ago

I also noticed another problem, the metrics object muse be held during the job running. If i create the gauge object in a python method, the metric will not be reported, unless put it in a set metrics_set = [] metrics_set.append(gauge). Is this because the temporary object was garbage collected by python?

ray.init()

metrics_set = []
def test_simple_metric():
    gauge = Gauge(
        "wanxing.test.gauge",
        description="wanxing test gauge.",
        tag_keys=("submission_id",),
    )
    metrics_set.append(gauge)

    gauge.set(100,
                tags={
                    "submission_id": os.getenv(
                        "BYTED_SUBMISSION_ID", "default_submission_id"
                    ),
                },)
    print("wanxing test gauge")

test_simple_metric()

@rkooo567