AntMan: Dynamic Scaling on GPU Clusters for Deep Learning

어떤 내용의 논문인가요? 👋

멀티테넌시한 DL 클러스터에서 많은 job들이 GPU 리소스를 기다리며 queueing되고 있는 동시에, GPU의 낮은 utilization가 발생하는 것은 큰 문제이다. 이런 현상이 발생하는 이유에는 다음 2가지가 있다. (1) DLT job들은 자신의 training 시간 동안 오롯이 GPU resource를 사용하지 않는다. (2) 기존의 예약 기반 접근은 현재 매커니즘상 DL job이 GPU partial resource 사용을 지원하지 않는다.
bin packing 방식의 스케줄링을 사용할 경우, utilization 자체는 상승할 지 모르지만 interference 등의 문제로 인해 성능이 떨어질 위험이 존재한다.
AntMan 은 GPU utilization을 높이면서 fairness와 job 간의 interference를 최소화해서 perfomance를 보장하는 DL 시스템이다.
AntMan 은 DL Training 중 남는 Resource를 다른 Job에게 빌려주는 매커니즘을 구현했다.
resource-garantee job 먼저 충분한 performance를 보장시켜 주고, opportunistic job 에게는 best-effort 한 방식으로 GPU 자원을 할당해준다.

Abstract (요약) 🕵🏻‍♂️

Efficiently scheduling deep learning jobs on large-scale GPU clusters is crucial for job performance, system throughput, and hardware utilization. It is getting ever more challenging as deep learning workloads become more complex. This paper presents AntMan, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs. AntMan accommodates the fluctuating resource demands of deep learning training jobs. As such, it utilizes the spare GPU resources to co-execute multiple jobs on a shared GPU. AntMan exploits unique characteristics of deep learning training to introduce dynamic scaling mechanisms for memory and computation within the deep learning frameworks. This allows fine-grained coordination between jobs and prevents job interference. Evaluations show that AntMan improves the overall GPU memory utilization by 42% and computation utilization by 34% in our multi-tenant cluster without compromising fairness, presenting a new approach to efficiently utilizing GPUs at scale.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

fairness와 efficiency에 대한 정확한 관계를 알 수 있었다(There is an inherent tension between providing fairness(e.g., to ensure SLAs of DL jobs with guaranteed resources) and achieving high resource utilization (e.g., GPU utilization), because of the constant fluctuation in both the load on a cluster and the resource needs of a job.)
Container GPU Scheduling 문제에서 global scheduler와 local coordinator를 명확히 구분하고 각각의 역할에 대해 명확히 서술한 점이 마음에 들었다.

레퍼런스의 URL을 알려주세요! 🔗

https://www.usenix.org/system/files/osdi20-xiao.pdf

오픈소스가 있다면 주소를 써 주세요!

https://github.com/alibaba/GPU-scheduler-for-deep-learning

Motivation

Deep Learning Training

딥러닝은 수백만의 반복으로 이루어지며, 각각의 반복에서는 mini-batch 라고 불리는 몇 개의 샘플을 처리한다. 딥러닝 과정을 간단히 서술하고 있다. 크게 3가지로 본 논문에서는 말하고 있다.

sample과 model weight는 score를 계산하는데 쓰이고, 이 단계를 Forward pass라고 한다.
목적함수를 이용, 1에서 나온 score와 desired 간의 차이, 즉 loss error 를 구한다.
모델 파라미터의 업데이트를 위해 gradient 가 learning rate 에 의해 scale 된다.

큰 회사에서는 multi-tenant 환경이 기본이라고 하는데, 이런 환경에서는 GPU oversubscribe 되는 경우도 있기 때문에 이걸 대비해야 한다고 말하고 싶은 듯 하다.

Characterizing Production DL Cluster

Low utilization of in-use GPUs

1주동안 heterogeneity 한 GPU 클러스터의 GPU 사용률, GPU 메모리 사용률을 추적해본 결과, 오직 10% 정도의 GPU만 80% 이상의 GPU Utilization 율을 보였다.

Idle waiting for gang-schedule

gang-scheduling 이란 모든 필요한 GPU가 모일 때까지 job이 시작하지 않는 것을 말한다. 이 때 요구되는 GPU가 많을수록 waiting time 또한 길어진다.

Dynamic resource demand

DL job 도중에도 사용하는 리소스 양이 천차만별이다. 이러한 현상은 필요한 자원을 예측하는 것을 힘들게 한다. 따라서 가장 많이 쓸 때를 기준으로 자원을 할당해야 하는데, 이는 결국 GPU Underutilization 의 원인이 된다.

Opportunities in DL Uniqueness

본 논문에서는 Mini-Batch 단위로 딥러닝 작업을 관찰한 결과 몇 가지 사실을 발견했다.

모델 사이즈가 보통 작기 때문에 GPU 메모리의 대부분은 다른 job 에게 양보 가능할 것이다.
mini-batch 연산 시간이 보통 작기 때문에 fine-grained 하게 GPU 시간과 memory를 할당해도 될 것이다.
mini-batch 들은 보통 유사한 job performance 를 보이기 때문에, 이것을 이용해 interference 가 일어나는지 평가해볼 수 있을 것이다. 주. 왼쪽 그림을 보면 90% 이상이 메모리를 500MB 이하로 사용하였고, 오른쪽 그림에서는 90% 이상이 mini-batch 시간이 900ms 가 안 걸렸다.

Method

Dynamic Scaling

Local Scheduler가 GPU Memory와 Computation 을 어떻게 다루는지에 대한 내용이다.

Memory

학습 속도 향상을 위해, GPU 메모리에 tensor 들을 캐싱하곤 한다. 그런데 현재 시스템으로는 필요없는 tensor를 제거한 후에도 캐시 용량이 줄어들지 않기 때문에, 이 공간은 오롯이 낭비되고 결국 이 공간에 대한 sharing의 기회가 없어지게 된다.

Antman에서는 동적으로 캐시 사이즈를 조절하며, lack이 심할 때 upper 캐시 사이즈를 최대한 줄이기 위한 노력의 일환으로 GPU 메모리가 정 부족한 경우 메인메모리로 tensor를 옮겨놓기도 한다. 그리고 좀 사정이 나아지면(upper가 다시 늘어날 때) 다시 메인메모리에서 gpu 메모리로 tensor를 옮겨놓는다. 이것은 주기억장치와 보조기억장치간에 paging 하는 작업을 벤치마킹한 것이다.

Computation

(a) job의 GPU 커널(GPU에서 병렬 실행되는 명령의 모음) 단위로 실행시키는데, job-A 하나만 두면 idle cycle이 발생한다.
(b) 그래서 또 다른 job-B를 실행시키니까, job-B의 GPU 커널 때문에 job-A가 방해받는(interference) 현상이 발생한다. 쉽게 말해, 굴러온 돌이 박힌 돌 빼낸 상황이다.
(c) AntMan 은 중간에 GpuOpManager라는 계층을 두어서, GPU 커널 간의 순서를 조절해준다. 또한 CPU연산과 GPU 연산 간의 동기화도 담당한다.

Schedulers

There is an inherent tension between providing fairness(e.g., to ensure SLAs of DL jobs with guaranteed resources) and achieving high resource utilization (e.g., GPU utilization), because of the constant fluctuation in both the load on a cluster and the resource needs of a job.

클러스터에 대한 load 와 DL job의 리소스 요구 모두 변동성이 크기 때문에, fairness(보장된 리소스로 DL 작업의 SLA를 보장하는 것)과 high resource utilization(GPU 활용률) 사이에는 trade-off가 존재한다.

Antman은 계층적으로 2개의 스케줄러로 구성되어 있다.

global scheduler 는 job scheduling 을 맡는다.
local scheduler 는 위에서 소개한 dynamic resource scaling 기법을 이용해 job들 간의 실행을 조절한다. 앞에서 자세히 설명했으므로 생략한다.

global scheduler

Antman 은 resource-guarantee job과 opportunistic job 으로 job을 구분한다. 전자는 반드시 특정한 양의 GPU 자원을 보장받아야만 하는 job 이고 후자는 아닌 job 이다.
resource-guarantee job 을 이용해 fairness 를 얻고, opportunistic job 을 이용해 high resource utilization 을 얻는다.
global scheduler 는 resource-guarantee job 큐와 opportunistic job 큐를 갖고 있다. 제안하는 알고리즘에 의해 resource-guarantee job 에게 네트워크 통신 등을 고려하며 최적의 노드를 매칭해준다. 동시에 gang-scheduling 에 의해 대기하는 GPU 리소스를 opportunistic job 에게 할당해준다.

local scheduler

local scheduler 의 최대 목표는 resource-guarantee job 에게 정당한 만큼의 자원을 보장해주는 것이고(fairness), 남은 자원을 opportunistic job에게 줘서 GPU utilization 을 높인다.

Experiment

쿠버네티스에서 kubeflow 를 이용해 job을 할당함.
GPU 메모리가 32GB 인 상황이라 두개 job을 다 하기에는 메모리가 부족하다.

0분에 Job-A가 먼저 도착해서 트레이닝 작업 수행하다가, Job-B가 26분에 도착한 상황. 그리고 서로 다른 스케줄링 기법을 썼을때 JCT 비교.

GPU CPU 관점

두 job이 돌아갈 때, Gandiva 와 비교해서 더 적극적으로 cpu 비율을 조절하는 모습.

GPU Memory 관점

적은 오버헤드로(ms 단위) 메모리 사용량을 조절하는 모습

Critic

Evalution 에서, memory 에서도 두 job이 같이 들어올 때 자원 사용량의 변화를 같이 보여줬으면 좋았을 것 같다.
나야 좋지만, local coordinator 에 대한 evaluation만 주로 이뤄지고 global scheduler 평가는 없는 것 같다.

msyhu / paper-logs