Open MissiontoMars opened 1 year ago
I think it is because you didn't call ray.init
, and it is only triggered when actor = MyActor.remote()
is called. We should probably add a better warning (or auto-init) when metrics APIs are used without ray.init.
@MissiontoMars do you have some bandwidth to take the issue?
I think it is because you didn't call
ray.init
, and it is only triggered whenactor = MyActor.remote()
is called. We should probably add a better warning (or auto-init) when metrics APIs are used without ray.init.@MissiontoMars do you have some bandwidth to take the issue?
That's exactly the reason!
I perfer to auto-init when metrics API are used. What's the appropriate way to auto-init? Just call ray.init()?
I also noticed another problem, the metrics object muse be held during the job running. If i create the gauge object in a python method, the metric will not be reported, unless put it in a set metrics_set = [] metrics_set.append(gauge)
. Is this because the temporary object was garbage collected by python?
ray.init()
metrics_set = []
def test_simple_metric():
gauge = Gauge(
"wanxing.test.gauge",
description="wanxing test gauge.",
tag_keys=("submission_id",),
)
metrics_set.append(gauge)
gauge.set(100,
tags={
"submission_id": os.getenv(
"BYTED_SUBMISSION_ID", "default_submission_id"
),
},)
print("wanxing test gauge")
test_simple_metric()
@rkooo567
What happened + What you expected to happen
We use the metrics api to report metrics, and the backend is replaced by an in-house implementation. When I was looking for problems that metrics failed to report, I noticed something very strange.
I reproduced the problem in ray 2.3.1(without any code changes, only few debug log).
RAY_enable_metrics_collection=true ray start --head --include-dashboard=true --dashboard-host="0.0.0.0"
RAY_ADDRESS=http://127.0.0.1:8265 ray job submit --runtime-env-json='{"working_dir": "./"}' -- python3.7 metric.py
To check whether metric is reported, a debug log is added to metrics_agent.py: https://github.com/MissiontoMars/ray/blob/releases_2.3.1/python/ray/_private/metrics_agent.py#L466C1-L466C1
When the job script running, i can see the debug logs about the metric in dashboard_agent.log, which indicates the metric is being reported.
But if I comment out this line of code:
actor = MyActor.remote()
, there is no logs. This is so weird.Versions / Dependencies
ray 2.3
Reproduction script
None
Issue Severity
High: It blocks me from completing my task.