Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU(MIG) allows us to partition one GPU to several different-sized independent instances. eg: partition one A100 to 7 instances(A100-1/7). But there are still some restrictions when partitioning GPU (see ch1).

Goal: partition GPUs regarding instances of different sizes to do DNN serving. satisfy ops/p99 requirement of different models (service level objectives, SLOs).

Problem Definition: Reconfigurable Machine Scheduling Problem (RMS). see ch3.1

We have several machines. Different machines have different processing time for different jobs.
These machines are reconfigurable (replace some machines to another set of machines), but under some rules.
Goal: find a sequence of scheduling and reconfiguration operations, to minimize/maximize some given objectives (eg: cost/performance/...).

[see ch3.3] This paper focuses on a variant of RMS: serving DNNs on GPUs with MIG.

Here the jobs (DNN serving) are long-running.
DNN serving jobs has non-linear performance.
We can reconfigure GPUs under some rules.
goal: find the most efficient GPU partitions and service assignments that minimizes the number of GPUs used.

pentium3 / sys_reading

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem #143