kubedl-io / kubedl

Run your deep learning workloads on Kubernetes more easily and efficiently.
https://kubedl.io/
Apache License 2.0
509 stars 79 forks source link

[feature request] job level scheduling and orchestration #241

Open SimonCqk opened 2 years ago

SimonCqk commented 2 years ago

What would you like to be added:

  1. a job-level queuing and orchestration module that admit job to be scheduled by some strategies.
  2. refactor skeleton codes to make control flow cleaner.

Why is this needed:

as for now, kubedl is capable to manage lifecycle of job workloads, manipulate pods from job started to succeeded or failed, however, in production users usually requests jobs(=resources) more resources than cluster capacity, therefore, job s have to wait in queue before getting chance to be schedule, for the reason that we propose a new module&policy to coordinate job schedule orders.

jian-he commented 2 years ago

@SimonCqk could you clarify on this ?

SimonCqk commented 2 years ago

@jian-he I have updated the description