microsoft / DLWorkspace

Deep Learning Workspace
Other
201 stars 75 forks source link

support preemptable inference jobs #1219

Closed leigaoms closed 4 years ago

leigaoms commented 4 years ago
  1. RestAPI:
    • SubmitJob(): parse "mingpu", "maxgpu" in jobParams, and also be compatible with "resourcegpu", "gpulimit".
    • ScaleJob(): change "resourcegpu" to "mingpu" and "maxgpu". No compatibility issue.
  2. JobManager:
    • Add "job_preemptable_resource", "allowed_resource" in job_info. If inference job is scheduling/running, subtract “mingpu” related resource, and keep the job for preempt gpu allocation later.
    • Scheduling logic: 1) Mark non-preempt training job: job status is "queued". 2) Mark inference job non-preempt part: job status is "queued". Allocate if all "mingpu" related resource could be satisfied. 3) Mark preempt training job: job status is "queued/scheduling/running". 4) Mark inference job preempt part: job status is "queued/scheduling/running". Allocate partial resource if not all "maxgpu" related resource could be satisfied. Suppose CPU/Memory is more than GPU, so allocate GPU first, CPU and memory is in proportion with allocated GPU.

Refinement:

  1. fair-sharing: In 4), share remaining GPU among different inference jobs in proportion or on average.
  2. Assumption does not apply for CPU cluster, CPU inference job
coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 3582


Totals Coverage Status
Change from base Build 3578: 0.0%
Covered Lines: 827
Relevant Lines: 874

💛 - Coveralls
xudifsd commented 4 years ago

Have you tested these, or is behaviour will like these:

leigaoms commented 4 years ago

Have you tested these, or is behaviour will like these: