ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.26k stars 5.81k forks source link

[Core] Make Ray Core tasks/actors metrics counters (accumulators) #47522

Open alexeykudinkin opened 2 months ago

alexeykudinkin commented 2 months ago

What happened + What you expected to happen

Currently, a lot of critical Ray Core metrics like number ray_tasks, etc are being produced as gauges, therefore making it impossible to precisely track tasks state transitions.

This happens b/c gauges are meant to be tracking point-in-time values and not incremental changes -- since gauge only stores the last value stored w/in the period it only shows snapshot of values of the last value written w/in the time-bucket instead of all incremental change

Instead, we'd add following metrics (keeping existing ones for BWC):

Apply similar scheme to actors.

Versions / Dependencies

2.35

Reproduction script

NA

Issue Severity

Medium: It is a significant difficulty but I can work around it.

anyscalesam commented 2 months ago

ACK - bring up for next planning cycle... cool @alexeykudinkin ?

alexeykudinkin commented 2 months ago

Yes, going to bump this to P0 after Ray Summit as this causing considerable confusion in troubleshooting due to inability to understand exactly state transition for tasks (in cases when transition happen faster than our current gauging frequency)

anyscalesam commented 2 months ago

Added to JIRA