[proposal] support fine-grained GPU management

jasonliu747 commented 2 years ago

This would be the parent proposal to track all related issues.

Why is this needed: With the rapid advances in the use of AI/ML there has also been tremendous growth in the use of GPUs for supporting the intense amount of compute required to train models, process images, etc.

Since GPU machines are extremely expensive, we would want to cram as much applications as possible. However, with the current Nvidia device plugin, which only allows you to specify how many GPUs you want to use, obviously that is not ideal, since most AI applications use about less than half of the GPU memory available. This is an obvious bottleneck in deploying AI applications for most companies.

Note that there are certain limitations in how you can specify resource requirements for GPUs:

You cannot overcommit GPUs — containers and pods do not share GPUs.
A container cannot request part of a GPU — each container can receive access to a full GPU or multiple GPUs.
Limits must be equal to requests — requests are what a container is guaranteed to get, while limits ensure the resources it receives do not exceed a certain value. GPUs are less flexible than other resources; when it comes to GPU resources, you may only specify limits. If you specify a request, it must be equal to the limit.
Kubernetes is not topology aware — Kubernetes does not support topology-aware GPU scheduling. In this case, GPUs are selected at random. The training speed varies based on different combinations of GPUs.

In this pinned proposal, we will come up with several child proposals to support multiple features regarding to GPUs in the future work.

What is your proposal: note: some items are still TBD, will be added into the following list ASAP.

proposal
- [x] #322
- [x] #333
- [x] #398
apis
- [x] #376
- [x] #378
- [x] #513
koordlet
- [x] #343
- [x] #323
- [x] #370
- [x] #381
- [ ] #335
- [x] #517
- [x] #571
- [ ] koordlet should allow overcommitment of GPU resources, and suppress or evict when needed.
koord-scheduler
- [x] #203
- [x] #202
- [x] #321
- [x] #628
- [x] #637
- [x] #1116
- [ ] scheduler should be able to decide which GPU card is most optimal, and allocate it through runtime-proxy
- [x] scheduler should be able to determine which nodes/cards are valid placements for each pod according to topology info reported by koordlet
- [ ] rebalance fragmented GPU resources, details TBD
- [ ] pods belong to the same job should be scheduled to the same type of GPU even though without label/affinity
- [ ] overcommitment, details TBD
- [ ] MIG support, details TBD
runtime-proxy
- [x] #503
- [x] #544

cheimu commented 2 years ago

In proposal, seems like for existing gpu ext resources, it is only compatible with Nvidia gpu in a hard code way. However, each public cloud has its own gpu virtualization solution, and all of ext resources are different, such as qgpu(tencent), cgpu(alicloud) and vgpu(aws), so how to adapt these gpu ext resources?

buptcozy commented 2 years ago

qgpu

Hi, in my opnion, it's easy to be compatible with the different gpu resource expression as you mentioned. When we receive the po d, we can execute a resource-translate-function just like "nvidia.com/gpu" to convert to koordinator resources.

cheimu commented 2 years ago

qgpu

Hi, in my opnion, it's easy to be compatible with the different gpu resource expression as you mentioned. When we receive the po d, we can execute a resource-translate-function just like "nvidia.com/gpu" to convert to koordinator resources.

yea, that sounds achievable, but how to make it easy to extend? I probably understand this solution incorrectly, does that mean we have to implement customized resource-translate-function and recompile ? This may only suit for pro gamer. However, for most of users who want to optimize their cost, they only want to configure something without modifying source code 🤔

cheimu commented 2 years ago

Probably add translate rule within crd?

eahydra commented 2 years ago

Considering that the resource protocols of these different cloud vendors are open, we can indeed directly support it in the koordinator. Implementing different translate functions through golang build constraints is a method, and abstracting different vendor protocols through interfaces is also a method. For some enterprise internal protocols, the above method can be used for processing, and it can also be converted through webhook, but it is only necessary to consider how to deal with the existing and running pods. Therefore, the resource protocol and other protocols defined by the koordinator should have clear semantics and necessary abstractions.

eahydra commented 2 years ago

Probably add translate rule within crd?

Really a good idea. But this also requires implementing an interpreter. Although it is possible to define some rules to simplify the implementation, it also brings about the problem of generality. In addition, how to map other protocols to the koordinator's protocol also involves semantic conversion, which may not be solved by means of regularization.

eahydra commented 2 years ago

New users of koordinator can also consider other strategies. For example, they can divide different node pools to run Pods using the koordinator protocol independently, and do not necessarily need to be deployed in combination with Pods of other protocols.

buptcozy commented 2 years ago

If really need, I prefer to use configuration to guide resource translation. like: { "nvidia.com/gpu": {"gpu-core":100, "gpu-mem-ratio":100}, "xxxgpu": {"gpu-core":100, "gpu-mem-ratio":100}, ... }

cheimu commented 2 years ago

New users of koordinator can also consider other strategies. For example, they can divide different node pools to run Pods using the koordinator protocol independently, and do not necessarily need to be deployed in combination with Pods of other protocols.

I got you, but I'm worried about other things. If what actually request is koordinator's gpu ext resources rather than vendor's gpu ext resources, will some vendor's internal functionality got broken? Such as scheduler and device plugin? I know qgpu may rely on qgpu-cores and qgpu-mems, how about cgpu?

I'm sure current implementation definitely will greatly improve native nvidia gpu, but not clearly in my mind how this will be compatible with vendor's gpu virtualization solution.... 😞

eahydra commented 2 years ago

New users of koordinator can also consider other strategies. For example, they can divide different node pools to run Pods using the koordinator protocol independently, and do not necessarily need to be deployed in combination with Pods of other protocols.

I got you, but I'm worried about other things. If what actually request is koordinator's gpu ext resources rather than vendor's gpu ext resources, will some vendor's internal functionality got broken? Such as scheduler and device plugin? I know qgpu may rely on qgpu-cores and qgpu-mems, how about cgpu?

Well, as you know, if the vendor does not expose the scheduler and device-plugin details, it's easy to break the vendor's internal functionality.

I'm sure current implementation definitely will greatly improve native nvidia gpu, but not clearly in my mind how this will be compatible with vendor's gpu virtualization solution.... 😞

I understand you are looking to solve compatibility issues when mixing different scheduler and DevicePlugin scenarios. This is a complicated question.

It is complicated even in simple scenarios where NVIDIA GPUs and Koordinator GPUs coexist. Currently Koordinator's existing proposal only defines how to compatible with existing Pods that use NVIDIA GPU. nvidia.com/gpu: 1 represents koordinator.sh/gpu-core: 100, koordinator.sh/gpu-memory-ratio: 100. But it does not define how Pods that only use Koordinator GPUs can be supported by other schedulers that only support NVIDIA GPUs. Suppose a Pod uses koordinator.sh/gpu-core: 50, koordinator.sh/gpu-memory: 3G, then it cannot be converted to nvidia.com/gpu equivalently.

It is obvious that Koordiantor GPU is similar to cGPU and qGPU, which are more fine-grained division of GPU resources. At the protocol layer, these defined fine-grained resource types are basically convertible to each other. Complicating matters are the limitations of the node-side GPU virtualization mechanism. From design consideration, only one GPU virtualization mechanism should be provided on the node at the same time, otherwise conflicts between different virtualization mechanisms will occur. Considering that Koordinator does not currently provide a virtualization mechanism, users must use the GPU virtualization mechanism provided by the vendor when using Koordinator. This means that the Koordinator GPU resource protocol can only be converted to the vendor's GPU protocol in one direction.

cheimu commented 2 years ago

Okay, I got you. I have one more question:

This means that the Koordinator GPU resource protocol can only be converted to the vendor's GPU protocol in one direction.

If so, how does koordinator scheduler work? Users set vendor's gpu ext-resource fields, then webhook hijacks it and convert it to koordinator resource fields, then scheduler does filter, reserver and prebind, and finally convert it back to vendor's ext-resource fields?

hormes commented 2 years ago

Okay, I got you. I have one more question:

This means that the Koordinator GPU resource protocol can only be converted to the vendor's GPU protocol in one direction.

If so, how does koordinator scheduler work? Users set vendor's gpu ext-resource fields, then webhook hijacks it and convert it to koordinator resource fields, then scheduler does filter, reserver and prebind, and finally convert it back to vendor's ext-resource fields?

Consider adding an extension point for this conversion in the scheduler, so that the koord-scheduler can identify the protocol on the existing Pods and schedule the new Pods correctly.

caohe commented 2 years ago

[x] api: add device CRD in scheduling group #376

[x] api: add device info into NodeMetric CRD #378

hello, I am confused which GPU-related information is in Device CRD and which is in NodeMetric CRD. In my understanding, information is organized as:

Device CRD:

basic info (uuid, minor, model, health...)
resource capacity
resource allocated?
topology info
pods that scheduled to this GPU

NodeMetric CRD:

resource usage

I am wondering:

whether my understanding is correct
whether GPU related information in NodeMetric is used for overcommitment
if batch GPU resources are introduced in the future, where they will be placed

zwzhang0107 commented 2 years ago

[x] api: add device CRD in scheduling group #376

[x] api: add device info into NodeMetric CRD #378

hello, I am confused which GPU-related information is in Device CRD and which is in NodeMetric CRD. In my understanding, information is organized as:

Device CRD:

basic info (uuid, minor, model, health...)

resource capacity

resource allocated?

topology info

pods that scheduled to this GPU

NodeMetric CRD:

resource usage

I am wondering:

whether my understanding is correct

whether GPU related information in NodeMetric is used for overcommitment

if batch GPU resources are introduced in the future, where they will be placed

yes
yes, in next few versions
batch gpu resource is still under design, maybe should defined in Device CRD

jasonliu747 commented 2 years ago

/assign

koordinator-sh / koordinator

[proposal] support fine-grained GPU management #332

This would be the parent proposal to track all related issues.