ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.59k stars 5.71k forks source link

[Job] Job API and Dashboard issues arise when a single job submission contains multiple jobs. #44730

Open liuxsh9 opened 6 months ago

liuxsh9 commented 6 months ago

What happened + What you expected to happen

It appears that Ray assumes a one-to-one correspondence between the submission ID and Job ID, meaning that a single submission only submits one job. However, under certain conditions, a submission ID may correspond to multiple Job IDs, such as when a job script contains multiple sub-jobs.

This is known to cause the following issues:

  1. Job ID fluctuations in the Dashboard Job List, likely due to filtering based on submission ID. Jobs with the same submission ID are only displayed once, which also leads to issues when viewing job logs.
  2. When GCS fault tolerance is configured based on Redis, the job list fails to display and the job management APIs become abnormal, likely due to the possibility of returning multiple values when using the submission ID as the key was not taken into account.

job api issue

We would like to know if it is allowed to have multiple jobs in a single submission. If it is allowed, my partners @Bye-legumes @nemo9cby and I can assist in resolving these issues.

Versions / Dependencies

Ray 2.9.3, KubeRay 1.0.0

Reproduction script

# scripy.py
import ray

with ray.init() as ctx:
    pass

with ray.init() as ctx:
    pass
ray job submit -- python script.py

Issue Severity

Low: It annoys or frustrates me.

anyscalesam commented 5 months ago

Thanks @liuxsh9 and sorry for the delay; we'll triage this and discuss this internally Tue next week. cc @jjyao

jjyao commented 4 months ago

@liuxsh9 I think you are right that Ray currently assumes the 1:1 mapping. I think it's non-trivial to change it.

Could you tell me why you want to have 1:N mapping? Could you just submit those sub-jobs individually?

liuxsh9 commented 4 months ago

@jjyao Thanks for you response! Users may use a Bash script to package a set of jobs. Before execution, they can select and configure jobs based on environment variables, such as cluster location, driver version, or start time. When the Bash script is used as the entrypoint, it will trigger multiple jobs. It's worth noting that this may not be the optimal way, but some users are intuitively accustomed to this approach, Ray does not prevent them from doing so. We believe that supporting this use case in the dashboard and job management would enhance the flexibility of using Ray clusters.