ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.89k stars 5.76k forks source link

[Dashboard] ability to search all the logs, all at once from a single place #29156

Open Joshuaalbert opened 2 years ago

Joshuaalbert commented 2 years ago

Description

When you have lots of actors and complicated system in production, and you want to respond quickly to an error, it is really hard to open up every single actor log file and then search them from the dashboard. In addition, searching through log files obfuscates the actor name, so it obfuscates clarity.

As a dev ops engineer, I would like:

Use case

When a bug happens in production, and it triggers some alert, we immediately have to find it. Sometimes the bug can be hard to find, as it might be in middleware, frontend, backend, .... However, in all cases we need to inspect the Ray logs. Sometimes it is easy to know where to look, in which actor, but sometimes it's not. And, there's just not enough time to open all the log files.

scottsun94 commented 2 years ago

@Joshuaalbert Thanks for the feedback!

Here are some questions to help us better understand your use case

  1. How do you currently collect, view and persist the ray logs in general?
    • Do you use ray dashboard, ray state api, etc.?
    • Do you use any logging products like loki, datadog, etc.?
  2. As a dev ops engineer, how do you find and search logs? What tools do you use?
  3. What products/services trigger the alerts in production?
  4. Any other users in your org will need to view and inspect ray logs?
stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.