msyhu / paper-logs

읽어야 하는 논문들을 관리하고, 읽은 논문들의 기록을 남기는 공간
7 stars 1 forks source link

HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees #2

Open msyhu opened 2 years ago

msyhu commented 2 years ago

어떤 내용의 논문인가요? 👋

Abstract (요약) 🕵🏻‍♂️

Deep learning training on a shared GPU cluster is becoming a common practice. However, we observe severe sharing anomaly in production multi-tenant clusters where jobs in some tenants experience worse queuing delay than they would have in a private cluster with their allocated shares of GPUs. This is because tenants use quota, the number of GPUs, to reserve resources, whereas deep learning jobs often use GPUs with a desirable GPU affinity, which quota cannot guarantee. HiveD is the first framework to share a GPU cluster safely, so that such anomaly would never happen by design. In HiveD, each tenant reserves resources through a Virtual Private Cluster (VC), defined in terms of multi-level cell structures corresponding to different levels of GPU affinity in a cluster. This design allows HiveD to incorporate any existing schedulers within each VC to achieve their respective design goals while sharing the cluster safely. HiveD develops an elegant buddy cell allocation algorithm to ensure sharing safety by efficiently managing the dynamic binding of cells from VCs to those in a physical cluster. A straightforward extension of buddy cell allocation can further support low-priority jobs to scavenge the unused GPU resources to improve cluster utilization. With a combination of real deployment and trace-driven simulation, we show that: (i) sharing anomaly exists in three state-of-the-art deep learning schedulers, incurring extra queuing delay of up to 1,000 minutes; (ii) HiveD can incorporate these schedulers and eliminate the sharing anomaly in all of them, achieving separation of concerns that allows the schedulers to focus on their own scheduling goals without violating sharing safety.

이 논문을 읽어서 무엇을 배울 수 있는지 알려주세요! 🤔

레퍼런스의 URL을 알려주세요! 🔗

https://www.usenix.org/system/files/osdi20-zhao_hanyu.pdf

오픈소스가 있다면 주소를 써 주세요!

https://github.com/microsoft/hivedscheduler