Open jhasm-ck opened 1 year ago
cc @richardliaw
Will job events be enough? Users can scrape the events and trigger downstream notification/alert systems?
Events is what we need, but scrapping (pull) is not efficient and building this scrapping infra across 100s of ray clusters is a major challenge. Also, a fundamental limitation with that approach is that, you cannot scrap for the events after the ray clusters is gone. If we are recommending folks to use ray clusters as ephemeral units, getting the timing right for the scrapping will never be perfect.
We need a mechanism to persist the event as it occurs, and make the events more durable than the ray cluster itself. Writing to a Kafka topic is one example. But more importantly building a way to allow folks to write their own listeners is a central piece and more general solution.
cc @akshay-anyscale
We have a similar request. Something like argo workflow's onExit step would be amazing. For example, we could specify a docker image so KubeRay can launch it with relevant information as arguments when a job is finished.
Any thoughts and updates on this request? We have long running Hydra jobs. We are using Ray launcher for HPO tasks. Some of them fail randomly. We like to figure out what is causing the failures.
cc: @anyscalesam
any update on this?
Is there any updates on this? We want to ingest the jobs to an external source
We're currently working with the community on producing an REP for an export API. This API would enable exporting metadata including job events via a REST API
We are aiming the API to be event-based so that it can be polled at variable rates and events will not be missed as long as the rate is reasonably frequent.
CC: @MissiontoMars
We're currently working with the community on producing an REP for an export API. This API would enable exporting metadata including job events via a REST API
We are aiming the API to be event-based so that it can be polled at variable rates and events will not be missed as long as the rate is reasonably frequent.
CC: @MissiontoMars
Got. I am working on it.
Description
Currently the only ways to know when a job fails or succeeds is through metrics or by manually looking at the dashboard. As a user I would like to get a notification when my job completes - fails or succeeds.
Metrics is not fine enough as there might be a large number of jobs running and I might be interested in a specific job. Manual check is time consuming and not reliable. A way of getting notified for the jobs or my interest is essential, especially given the long running nature of ML jobs.
While building all kinds of notification mechanisms is not in scope of ray, you can provide a way to register a listener. People can implement the listeners to send notifications to the system of their choice, and share popular listeners with the community.
Use case