ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.06k stars 5.78k forks source link

[jobs] Support notifications for ray job and actor lifecycle events #39508

Open jhasm-ck opened 1 year ago

jhasm-ck commented 1 year ago

Description

Currently the only ways to know when a job fails or succeeds is through metrics or by manually looking at the dashboard. As a user I would like to get a notification when my job completes - fails or succeeds.

Metrics is not fine enough as there might be a large number of jobs running and I might be interested in a specific job. Manual check is time consuming and not reliable. A way of getting notified for the jobs or my interest is essential, especially given the long running nature of ML jobs.

While building all kinds of notification mechanisms is not in scope of ray, you can provide a way to register a listener. People can implement the listeners to send notifications to the system of their choice, and share popular listeners with the community.

Use case

  1. Getting a notification when my job fails.
  2. Getting a notification when my job completes.
  3. Being able to build notifications to any system or platform of my choice.
  4. Being able to decide which notifications to send out based on the job in the event.
jhasm-ck commented 1 year ago

cc @richardliaw

scottsun94 commented 1 year ago

Will job events be enough? Users can scrape the events and trigger downstream notification/alert systems?

jhasm-ck commented 1 year ago

Events is what we need, but scrapping (pull) is not efficient and building this scrapping infra across 100s of ray clusters is a major challenge. Also, a fundamental limitation with that approach is that, you cannot scrap for the events after the ray clusters is gone. If we are recommending folks to use ray clusters as ephemeral units, getting the timing right for the scrapping will never be perfect.

We need a mechanism to persist the event as it occurs, and make the events more durable than the ray cluster itself. Writing to a Kafka topic is one example. But more importantly building a way to allow folks to write their own listeners is a central piece and more general solution.

shrekris-anyscale commented 1 year ago

cc @akshay-anyscale

han-steve commented 11 months ago

We have a similar request. Something like argo workflow's onExit step would be amazing. For example, we could specify a docker image so KubeRay can launch it with relevant information as arguments when a job is finished.

songole commented 6 months ago

Any thoughts and updates on this request? We have long running Hydra jobs. We are using Ray launcher for HPO tasks. Some of them fail randomly. We like to figure out what is causing the failures.

scottsun94 commented 6 months ago

cc: @anyscalesam

prd-tuong-nguyen commented 6 months ago

any update on this?

ammarar commented 5 months ago

Is there any updates on this? We want to ingest the jobs to an external source

alanwguo commented 5 months ago

We're currently working with the community on producing an REP for an export API. This API would enable exporting metadata including job events via a REST API

We are aiming the API to be event-based so that it can be polled at variable rates and events will not be missed as long as the rate is reasonably frequent.

CC: @MissiontoMars

MissiontoMars commented 5 months ago

We're currently working with the community on producing an REP for an export API. This API would enable exporting metadata including job events via a REST API

We are aiming the API to be event-based so that it can be polled at variable rates and events will not be missed as long as the rate is reasonably frequent.

CC: @MissiontoMars

Got. I am working on it.