Open EnjianGong opened 1 year ago
cc @jjyao
Is there a solution to this issue @clarng
Currently this is not possible
Do you have more details of the proposal? Depending on what it is perhaps it is possible to modularize the code base
@EnjianGong I'd like to know more about the scheduling mechanism you want. Could you elaborate.
Currently, there is only one simple idea: Train a resource prediction model to be used in the scheduler's inference process.
There are several challenges:
So Ray scheduler is basically a function that maps from input task + cluster status to the output node. You are saying training a ML model to be that function?
Yes, use this function to obtain a resource prediction
Description
Most of our company's business is some AP offline tasks, so the resource utilization rate of the entire ray cluster is very low, and there are a lot of idle resource fragments in time and space. Therefore, I look forward to using DQN to learn the resource wave model to predict resource occupancy for a period of time in the future, and then share these idle resources to reduce costs. Therefore, we want to customize Ray core's scheduling mechanism to adapt to workload fluctuations
Use case
Workload scenario: There are a lot of idle resources in space and time, and the resource occupation has a strong periodicity