ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.41k stars 5.66k forks source link

[Core|Dashboard] Support custom tags for jobs. #34187

Closed jhasm closed 1 year ago

jhasm commented 1 year ago

Description

It will be very helpful to allow users to add custom tags to ray jobs at the time of submission, and see those tags in job metadata on the dashboard. If there is a way of exporting this data or storing it in an external durable storage, this metadata will become a great resource for fault, latency, resource analysis and reporting.

The tags can be used for grouping, filtering and reporting purposes. This can also be used for enriching ML experiment tracking and model lineage, as well as cost attribution.

Use case

We run multi-tenant ray clusters to balance between the number of idle ray clusters latency of creating a new cluster on-demand. This means there are multiple users from various teams/projects submitting jobs to a given ray cluster. In this situation, the only definite information the users know is the ray job_id and they have to track their jobs in a separate database to track their work for the team or project.

Following challenges are hard to address without any metadata on the jobs.

  1. Group jobs by users, teams or projects. Report an aggregated view jobs.
  2. Attribute jobs to a user, team or model.
  3. Attribute cost to a user, team or model.
  4. Estimate resource requirements or plan budgets at team or project level.
  5. Identify common patterns of job failures or delays, or offenders of compute resources across jobs.
  6. Analyze the job data at user, team or project level to get insights and make recommendations.
  7. Enrich the data and model lineage graphs with ray jobs and associated metadata for better experiment tracking and reproducibility.
rkooo567 commented 1 year ago

cc @edoakes @architkulkarni

edoakes commented 1 year ago

There is already a metadata field where you can pass an arbitrary JSON dictionary and it's returned in the GET job endpoint. This just needs to be exposed in the dashboard.

rkooo567 commented 1 year ago

Added to the polish item

alanwguo commented 1 year ago

@edoakes , @scottsun94 noticed that these metadatas are not able to be passed in via the CLI today. Is this something that can be added?

edoakes commented 1 year ago

certainly can be

rkooo567 commented 1 year ago

Maybe cc @architkulkarni? seems like an easy fix (I assume it would take 10m to finish it). If he's busy I can also take a look

architkulkarni commented 1 year ago

Addressed here https://github.com/ray-project/ray/pull/34586 for the CLI.