Core: Can the ray core's scheduling mechanism support customized extensions?

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.58k stars 5.71k forks source link

Core: Can the ray core's scheduling mechanism support customized extensions? #33735

Open EnjianGong opened 1 year ago

EnjianGong commented 1 year ago

Description

Most of our company's business is some AP offline tasks, so the resource utilization rate of the entire ray cluster is very low, and there are a lot of idle resource fragments in time and space. Therefore, I look forward to using DQN to learn the resource wave model to predict resource occupancy for a period of time in the future, and then share these idle resources to reduce costs. Therefore, we want to customize Ray core's scheduling mechanism to adapt to workload fluctuations

Use case

Workload scenario: There are a lot of idle resources in space and time, and the resource occupation has a strong periodicity

rkooo567 commented 1 year ago

cc @jjyao

EnjianGong commented 1 year ago

Is there a solution to this issue @clarng

rkooo567 commented 1 year ago

Currently this is not possible

clarng commented 1 year ago

Do you have more details of the proposal? Depending on what it is perhaps it is possible to modularize the code base

jjyao commented 1 year ago

@EnjianGong I'd like to know more about the scheduling mechanism you want. Could you elaborate.

EnjianGong commented 1 year ago

Currently, there is only one simple idea: Train a resource prediction model to be used in the scheduler's inference process.

Observation: CPU/MEM/IO load status
Reward: resource utilization rate, task satisfaction rate

There are several challenges:

How to coordinate the rule-based scheduler and the model-based scheduler, and determine their priority relationship.
Over time, the optimal scheduling will experience distribution drift.

jjyao commented 1 year ago

So Ray scheduler is basically a function that maps from input task + cluster status to the output node. You are saying training a ML model to be that function?

EnjianGong commented 1 year ago

Yes, use this function to obtain a resource prediction