volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.25k stars 971 forks source link

feature: integrate with kwok to simulate mock GPU/NPU nodes #3830

Open JesseStutler opened 5 days ago

JesseStutler commented 5 days ago

What type of PR is this?

/kind feature

What this PR does / why we need it:

related issue: #3829

verfication

  1. Use ./create-fake-node.sh -n 10 -c 4 -m 8Gi -e volcano.sh/gpu-number=4,volcano.sh/gpu-memory=20 to create 10 fake nodes with 4 CPUs, 8Gi memories and extended resources with volcano.sh/gpu-number=4,volcano.sh/gpu-memory=20. After successfully creating these nodes, take one of them as an example: image image

  2. Open deviceshare plugin and set the argument deviceshare.GPUNumberEnable enabled, and then create a fake deployment to create a pod requesting 1 volcano.sh/gpu-number and 1 volcano.sh/gpu-memory, successfully scheduled:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: fake-gpu-pod
    namespace: default
    spec:
    replicas: 1
    selector:
    matchLabels:
      app: fake-gpu-pod
    template:
    metadata:
      labels:
        app: fake-gpu-pod
    spec:
      schedulerName: volcano
      tolerations:
      - key: "kwok.x-k8s.io/node"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: fake-container
        image: fake-image
        resources:
          limits:
            volcano.sh/gpu-number: 1
            volcano.sh/gpu-memory: 1

    image

volcano-sh-bot commented 5 days ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign hwdef You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/volcano-sh/volcano/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
JesseStutler commented 5 days ago

I found that under benchmark there is already a script to deploy kwok and fake nodes, maybe move these into benchmark/kwok is better.