ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.19k stars 5.48k forks source link

[core][telemetry] add real resource cluster utilization to telemetry #32499

Open clarng opened 1 year ago

clarng commented 1 year ago

What happened + What you expected to happen

Add telemetry for cluster utilization (cpu, memory)

Possibly gpu as well

Versions / Dependencies

master

Reproduction script

n/a

Issue Severity

None

hora-anyscale commented 1 year ago

@clarng - what do you feel is the right priority for this enhancement?

rickyyx commented 1 year ago

I did this as part of my ray doctor hackathon - feel free to let me know if you want some pointers. I did it with the prometheus client to query metrics.