Open alexeykudinkin opened 2 months ago
ACK - bring up for next planning cycle... cool @alexeykudinkin ?
Yes, going to bump this to P0 after Ray Summit as this causing considerable confusion in troubleshooting due to inability to understand exactly state transition for tasks (in cases when transition happen faster than our current gauging frequency)
Added to JIRA
What happened + What you expected to happen
Currently, a lot of critical Ray Core metrics like number
ray_tasks
, etc are being produced as gauges, therefore making it impossible to precisely track tasks state transitions.This happens b/c gauges are meant to be tracking point-in-time values and not incremental changes -- since gauge only stores the last value stored w/in the period it only shows snapshot of values of the last value written w/in the time-bucket instead of all incremental change
Instead, we'd add following metrics (keeping existing ones for BWC):
ray_active_tasks
-- a gauge tracking currently active tasksray_tasks_total
-- accumulator (counter) tracking all state transitions for every task (ie when task changes state we increment it's corresponding accumulator)Apply similar scheme to actors.
Versions / Dependencies
2.35
Reproduction script
NA
Issue Severity
Medium: It is a significant difficulty but I can work around it.