ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.18k stars 5.61k forks source link

[Doc] Document ray.exceptions.LocalRayletDiedError and common casues in observability doc #33200

Open scottsun94 opened 1 year ago

scottsun94 commented 1 year ago

Description

Users want to know the common causes of errors. Here is an example: https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1678139301678139

Link

No response

scottsun94 commented 1 year ago

cc: @angelinalg

rickyyx commented 1 year ago

Is this something helpful? https://docs.ray.io/en/latest/ray-core/api/exceptions.html

Maybe we should have surfaced this better.

scottsun94 commented 1 year ago

Is this something helpful? https://docs.ray.io/en/latest/ray-core/api/exceptions.html

Maybe we should have surfaced this better.

Yeah, but I feel that few people will check this out.

Ideally, we print the possible causes together with the error and a link to the documentation page with more details

angelinalg commented 1 year ago

It sounds like you're describing a Troubleshooting Guide. Is there an existing doc of common errors? If not, one way to seed a guide like this is to scrape slack channels for common questions.

scottsun94 commented 1 year ago

We do have Troubleshooting Guide

Screen Shot 2023-03-10 at 3 11 26 PM

Maybe we just need to include more common errors and their causes here.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

scottsun94 commented 1 year ago
Screenshot 2023-08-14 at 6 18 10 AM

After doc refactoring in 2.5, we have this page.

We should add common errors here.

Concretely, we should add ray.exceptions.LocalRayletDiedError first

stale[bot] commented 11 months ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!