[Core] Make Ray Core tasks/actors metrics counters (accumulators)

alexeykudinkin commented 2 months ago

What happened + What you expected to happen

Currently, a lot of critical Ray Core metrics like number ray_tasks, etc are being produced as gauges, therefore making it impossible to precisely track tasks state transitions.

This happens b/c gauges are meant to be tracking point-in-time values and not incremental changes -- since gauge only stores the last value stored w/in the period it only shows snapshot of values of the last value written w/in the time-bucket instead of all incremental change

Instead, we'd add following metrics (keeping existing ones for BWC):

ray_active_tasks -- a gauge tracking currently active tasks
ray_tasks_total -- accumulator (counter) tracking all state transitions for every task (ie when task changes state we increment it's corresponding accumulator)

Apply similar scheme to actors.

Versions / Dependencies

2.35

Reproduction script

NA

Issue Severity

Medium: It is a significant difficulty but I can work around it.

anyscalesam commented 2 months ago

ACK - bring up for next planning cycle... cool @alexeykudinkin ?

alexeykudinkin commented 2 months ago

Yes, going to bump this to P0 after Ray Summit as this causing considerable confusion in troubleshooting due to inability to understand exactly state transition for tasks (in cases when transition happen faster than our current gauging frequency)

anyscalesam commented 2 months ago

Added to JIRA

ray-project / ray