ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.05k stars 5.78k forks source link

Observability Roadmap #30097

Open alanwguo opened 2 years ago

alanwguo commented 2 years ago

Observability Roadmap

A huge part of being successful at developing applications on top of Ray is being able to successfully debug and optimize those applications. In order to do that, one must be able to understand the behavior of their ray applications so they can address any bugs or issues that break or slow their application. The goal of our observability efforts is to provide all the information needed to effectively write, debug, optimize, and monitor ray applications.

Since the Ray runtime handles many of the low level system behavior of the ray application, we’re also in a unique position to provide data about ray application out of the box using our State API and Dashboard UI. Ultimately, we believe we can add a ton of value to the Ray experience by providing the most relevant data when you need it, great visualizations to understand that data, and the right set of tools to dig deeper into problems. We’re not alone in that thinking. In fact, one of the most popular talks at the Ray Summit 2022 was Ray Observability: Present and Future.

For the observability roadmap, the high level prioritization is as follows: we prioritize building out valuable content first (low hanging fruit), then making significant usability improvements with our UI, and finally, introducing advanced visualizations.

Help us shape the roadmap!

Before we begin, we highly encourage you to provide feedback for our roadmap! Please message us in the ray slack in the #dashboard channel or in the dashboard forum at https://discuss.ray.io/c/dashboard/9.

Delivered features

Features from Ray 2.2 Features from Ray 2.3

Ray 2.4

State API Beta

Since the alpha release of State API in 2.0, we have been collecting feedback from Ray developers. In the beta releases, we continue to improve the State API based on the user feedback by exposing the most useful states of Ray resources like actors, tasks and nodes. We are also stabilizing many of the CLI and outputs schema so that Ray developers could build their own observability tools on top of the State APIs without worrying about changing APIs.

Please take 5-8 mins to help us make better Ray State API by fulfilling this :page_facing_up:survey! If you are interested in chatting more, there will also be a link at the end of the survey to choose a time slot to :phone:chat with one of us!

Beyond

Some of these things are early stages in the design process. Things may change before the final feature is released, but we want you all to know what’s coming so you can provide feedback earlier in the process.

Advanced task drill down visualizations

We are also planning to further improve the advanced task visualization.

The tracing view lets you view the hierarchy of dependencies for your tasks so you can drill down and understand why the application is behaving as it is. For example, you can see that some tasks are running serially because it depends on another task.

image

The DAG view displays the relationship between tasks/actors and the execution state over time.

image

Data visualizations

With distributed applications, the usage, storage, and transfer of data is often a critical part of the application. We believe visualizations that help you understand these things will enable users to debug memory crashes or optimize data transfer.

image image

Advanced profiling

We are planning to make it easy to run other advanced profilers such as memory profiler, GPU profiler, or framework profilers (e.g., Pytorch) against Ray actors/tasks/workers.

dmatrix commented 2 years ago

This is fabulous!

tianlinzx commented 1 year ago

This is fabulous!

rkooo567 commented 1 year ago

We released Ray 2.2, and the following features have been delivered.

Ray 2.2

Metrics improvements

Metrics gives a glance views of the cluster which help users to detect problems effectively. Ray 2.1 introduces the default metrics graph integration to the dashboard. We’re adding more metrics and improvements to the Dashboard UI, including debugging breakdowns for object store memory allocations, actor state breakdowns, and heap memory usage by Ray component!

Profiling tool

Profiling Python programs is necessary to debug performance or memory leak issues. However, it has been difficult to profile Ray programs that have 100s of workers running concurrently.

In Ray 2.2, users can easily run py-spy against all running workers through Ray dashboard.

Screen Shot 2022-11-08 at 9 58 45 AM

image

Task visualization improvements

Observability starts from understanding what’s going on from the program.

We are adding task-based breakdowns for your ray jobs. This view allows you to quickly view at a glance the tasks with the most errors or the ones that are hanging.

image

Dashboard stability improvements

We continue to make improvements to the stability and the scalability of the dashboard. We are going to guarantee the stable latency of Dashboard APIs at large scale clusters while minimizing the performance impact on workloads running in the cluster.

itamarst commented 1 year ago

I work on a profiler for Python data processing applications (https://sciagraph.com), including profiling in production. Currently only designed for jobs with subprocesses, aggregating from a cluster is not possible yet. Perhaps a reasonable integration would be per graph item? So would be happy to talk about that if it's interesting to you.

alanwguo commented 1 year ago

@itamarst, that sounds interesting. I'll send an email to you and we can continue the conversation there

TUB-hasib commented 1 year ago

Hi, where can I get information about the difference between ray serve version 2 and version 3? also when will we get the version 3 as a stable version

richardliaw commented 1 year ago

When you see v3.0.0, this means you are on the bleeding edge nightly wheels. 3.0.0 won't be released for a long time, but we will release 2.4 and 2.5 next, which are cut off of the 3.0 (master) branch -- you should instead use the stable latest version (2.x).

rkooo567 commented 1 year ago

We released Ray 2.3, and the following features have been delivered.

See the Ray 2.3 release blog for more information!

Ray 2.3 also includes the following features other than the below two big features.

Dashboard usability improvements

We’re also looking into a revamp of the dashboard UI to improve the information hierarchy and usability. We are taking a user-journey driven approach of organizing the dashboard so that developers and infra engineers alike can quickly get to the information they need. This means organizing the dashboard by top level concepts like jobs, cluster (nodes and autoscaler) and logs, better navigability so you can quickly click to go to the information you need, and more visualizations and content so you can dig into more details of your application.

image

Ray timeline and advanced progress bar.

We wish to build out more advanced visualizations of the tasks that ran in a ray application. In particular, we want these visualizations to be valuable after a ray job has finished (either successfully or errored).

image

The timeline view is a higher level view that lets you optimize or debug errors in your job. You can quickly see how long tasks are taking to run in your application and how well the workload is distributed across all the workers in your cluster.

We also want to add improvements to the progress bar. For example, by adding conceptual task groups so that progress can be viewed from high level steps. We also want to make it easier to determine if errors occurred within the task itself or because a downstream dependency errored.

image

scottsun94 commented 1 year ago

Here is the Public PRD for Ray Logging which will guide the future improvements to Ray Logging.

Please take a look and leave your feedback.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.