koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.25k stars 315 forks source link

[BUG] Koordinator doesn't support multiple card sharing #2097

Open ZiMengSheng opened 3 weeks ago

ZiMengSheng commented 3 weeks ago

What happened:

A node has 8 GPU cards, each GPU card has 80 Gi GPU memory. I want to use four cards, each GPU card 40 Gi GPU Memory via koordinator.sh/gpu.shared. But pod will stuck in Pending phase.

apiVersion: v1
kind: Pod
metadata:
  name: pod-example
  namespace: default
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 40m
        memory: 40Mi
        koordinator.sh/gpu.shared: "4"
        koordinator.sh/gpu-memory: 160Gi
      requests:
        cpu: 40m
        memory: 40Mi
        koordinator.sh/gpu.shared: "4"
        koordinator.sh/gpu-memory: 160Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always

What you expected to happen:

Pod should be scheduled.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

AdrianMachao commented 3 weeks ago

/assign

ZiMengSheng commented 3 weeks ago

/assign

Welcome! You can refer to this proposal

AdrianMachao commented 2 weeks ago

I have started doing it, but I need sometime to understand your design principle and code, I will try my best to complete it as soon as possible

ZiMengSheng commented 2 weeks ago

I have started doing it, but I need sometime to understand your design principle and code, I will try my best to complete it as soon as possible

OK, if you need help, questions or discussions by this github issue or DingDing talk are both welcome!

AdrianMachao commented 5 days ago

is it the implement of mutate and validate webhook in the path of pkg/webhook/pod/mutating/extended_resource_spec.go? I didn't see any work of gpu extender resource, I am doing this task now @ZiMengSheng

AdrianMachao commented 5 days ago

I have started doing it, but I need sometime to understand your design principle and code, I will try my best to complete it as soon as possible

OK, if you need help, questions or discussions by this github issue or DingDing talk are both welcome!

what is your DingDing account, Can I add friends?

ZiMengSheng commented 5 days ago

is it the implement of mutate and validate webhook in the path of pkg/webhook/pod/mutating/extended_resource_spec.go? I didn't see any work of gpu exte

王建宇

ZiMengSheng commented 5 days ago

is it the implement of mutate and validate webhook in the path of pkg/webhook/pod/mutating/extended_resource_spec.go? I didn't see any work of gpu extender resource, I am doing this task now @ZiMengSheng

The scheduler need to calculcate requestsPerCard and numGPUs by gpu.shared protocol.