ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.7k stars 5.73k forks source link

[Core] dump the info and anaylze the data offline #28496

Open hellofinch opened 2 years ago

hellofinch commented 2 years ago

Description

Ray dump the log which can be visualized in the dashboard or something else. The system info can be recorded such as the usage of CPU, the bandwidth, and the usage of disk and memory.

Use case

Ray dumped all logs and used the logs to show how the program runs dynamically. I can analyze the performance of the distributed program and optimize the program.

scottsun94 commented 2 years ago

@hellofinch Could you be more specific? In ray 2.0, the dashboard visualizes the CPU, disk, memory and includes the logs. What else do you expect to see or use?

scottsun94 commented 2 years ago

Another note: Ray already supports exporting metrics in Prometheus format: https://docs.ray.io/en/latest/ray-observability/ray-metrics.html

hellofinch commented 2 years ago

@scottsun94 Thanks for your response! I read the link you give. I think this is not what I need. I use Ray in a cluster where computing nodes and the login node are separated. I submit a script to set up the Ray cluster and run my program. I have no access to the ports which are opened on the computing node. If there could save a log file, I can check the info after my program is done and Ray's cluster tear down. As I know, the dashboard only visualizes the usage of CPU, disk, and memory which only show the nodes' resource usage. I'm interested in each task's resource usage. In this way, I can analyze my program more vividly.

scottsun94 commented 2 years ago

RE: "each task's resource usage". What do you refer to by "resource usage"? You mean physical CPU/disk/memory usage by each task?

cc: @rkooo567 @ericl @rickyyx on saving the metrics as log files.

hellofinch commented 2 years ago

yes, that is what I mean. It will help me to analyze each part of my program and show where is my program's bottleneck.

rkooo567 commented 2 years ago

I think the currently available information from the dashboard is not sufficient to do I'm interested in each task's resource usage. In this way, I can analyze my program more vividly.. We are planning to improve in the short term (next 3~4 months) and then we will consider to allow persistence of the dashboard state after that (you can probably achieve this when we are working on this part).

scottsun94 commented 1 year ago

Not fixed. Keep it open for tracking