공유되는 GPU 클러스터 환경에서 GPU를 사용하고자 하는 유저들(세입자들(tenants)) 이 DL training job을 위해 GPU 클러스터에서 대기하고 있을때 일어날 수 있는 문제점에는 크게 두가지가 있다.
affinity 문제 : 같은 노드에 있어야 하는 DLT job container 이 있을 경우, GPU가 분산되어 있으므로 꼭 같은 노드에 들어갈 수 있으리란 보장이 없다. 이때 soft한 best-effort 방법으로, 즉 다른 노드여도 어쩔수 없이 스케줄링 할 경우 스케줄링은 바로 되겠지만 성능이 떨어지고, hard한 guaranteed 방법으로 스케줄링 할 경우 조건에 딱 맞는 GPU들이 준비될 때 까지 queueing delay 상황이 발생할 것이다.
sharing anomaly 문제 : GPU 자원이 공유될 경우 메모리와 비슷하게 GPU에서도 hard affinity에 의해 '외부 단편화' 현상이 발생할 수 있다. 따라서 이것 또한 이용 가능한 자원이 있어도 할당되지 못 하고 대기하는 queueing delay를 일으킨다.
위 문제들에서 공통적으로 queueing delay 문제에 주목하고 있으며, 이것은 DL training job container 가 사용자가 지정한 affinity 조건에 맞는 자원을 할당받지 못해 큐에서 대기하고 있는 상황을 말한다. 따라서 queueing delay의 정도에 따라 같은 작업이라도 완료 시간에 큰 차이가 발생하게 되므로 성능 일관성을 해치게 되는 주요 요소이다.
본 논문에서는 HiveD 라는 쿠버네티스 컨테이너 환경에서의 deep learning만을 위한 리소스 예약 프레임워크를 제안한다. HiveD는 관심사의 분리를 통해 cluster Utilization, Job Completion Time, fairness를 개선하는 데 집중한다.
hiveD 는 cell 이라는 virtual cluster(VC) 개념의 논리적인 계층을 도입했다. tenant당 하나의 VC를 가지며, VC 개념과 buddy cell allocation 이라는 VC-물리 클러스터 맵핑 알고리즘을 이용해 단순 quota(GPU 갯수) 뿐 만 아니라 네트워크 구성, GPU 이기종성, 같은 노드에 있는지 등의 affinity 를 고려하며 지능적인 스케줄링 결정을 내린다.
결론적으로 HiveD는 queueing delay를 최소화하여 기존 스케줄러들보다 DL training job 들의 완료 시간의 차이를 줄이는, 즉 성능 일관성을 향상시킬 수 있었다.
Abstract (요약) 🕵🏻♂️
Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee.
HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD,
each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers
within each VC to achieve their respective design goals while sharing the cluster safely.
HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic
binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization.
With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three
state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate
these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.
어떤 내용의 논문인가요? 👋
Abstract (요약) 🕵🏻♂️
Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.
이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔
레퍼런스의 URL을 알려주세요! 🔗
https://www.usenix.org/system/files/osdi20-zhao_hanyu.pdf
오픈소스가 있다면 주소를 써 주세요!
https://github.com/microsoft/hivedscheduler