openshift / instaslice-operator

InstaSlice Operator facilitates slicing of accelerators using stable APIs
Apache License 2.0
17 stars 12 forks source link

Unable to provision 7g.40gb slice on A100 40GB #285

Open bharathappali opened 4 days ago

bharathappali commented 4 days ago

I was trying to create dynamic slices with instaslice on a openshift cluster which has a node with 4 A100 GPU's. I found that instaslice is creating MIG for any config less than 7g.40gb but it's not able to create a MIG for 7g.40gb

I have tried the same workload with 7g.40gb and 4g.20gb slice and here are the details.

Instaslice image built from the branch release-4.19

[abharath@abharath-thinkpadt14sgen2i instaslice-operator]$ git branch
  main
* release-4.19

Node allocatable resources:

Allocatable:
  cpu:                                             127500m
  ephemeral-storage:                               430324950326
  hugepages-1Gi:                                   0
  hugepages-2Mi:                                   0
  instaslice.redhat.com/accelerator-memory-quota:  160Gi
  instaslice.redhat.com/mig-1g.10gb:               16
  instaslice.redhat.com/mig-1g.5gb:                28
  instaslice.redhat.com/mig-1g.5gb+me:             28
  instaslice.redhat.com/mig-2g.10gb:               12
  instaslice.redhat.com/mig-3g.20gb:               8
  instaslice.redhat.com/mig-4g.20gb:               4
  instaslice.redhat.com/mig-7g.40gb:               4
  memory:                                          1055311156Ki
  nvidia.com/gpu:                                  0
  nvidia.com/mig-3g.20gb:                          0
  nvidia.com/mig-4g.20gb:                          0
  pods:                                            250

Instaslice controller logs:

{"level":"info","ts":"2024-11-27T09:22:26.337021649Z","caller":"controller/instaslice_controller.go:443","msg":"no suitable node found in cluster for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"4991cd22-fe6f-4832-b820-a856ea5f01da","pod":"human-eval-deployment-job-j84fc"}
{"level":"info","ts":"2024-11-27T09:22:36.337786468Z","caller":"controller/capacity.go:48","msg":"cpu request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"a1cecafe-2a40-414b-8b22-38ad0698d2ea","pod":"human-eval-deployment-job-j84fc","value":2}
{"level":"info","ts":"2024-11-27T09:22:36.337871857Z","caller":"controller/capacity.go:56","msg":"memory request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"a1cecafe-2a40-414b-8b22-38ad0698d2ea","pod":"human-eval-deployment-job-j84fc","value":4294967296}
{"level":"info","ts":"2024-11-27T09:22:36.337903101Z","caller":"controller/instaslice_controller.go:443","msg":"no suitable node found in cluster for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"a1cecafe-2a40-414b-8b22-38ad0698d2ea","pod":"human-eval-deployment-job-j84fc"}

workload yaml:

kind: Namespace
metadata:
  name: kruize-gpu-rec-apply
---
kind: Job
apiVersion: batch/v1
metadata:
  name: human-eval-deployment-job
  namespace: kruize-gpu-rec-apply
spec:
  template:
    spec:
      containers:
        - name: human-eval-benchmark
          image: 'quay.io/kruizehub/human-eval-deployment:latest'
          env:
            - name: num_prompts
              value: '20000' 
          resources:
            requests:
              cpu: 2
              memory: 4Gi
              nvidia.com/mig-7g.40gb: 1
            limits:
              cpu: 2
              memory: 4Gi
              nvidia.com/mig-7g.40gb: 1 
          volumeMounts:
            - name: cache-volume
              mountPath: /.cache/huggingface
          imagePullPolicy: IfNotPresent
      restartPolicy: Never
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cache-pvc
  namespace: kruize-gpu-rec-apply
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi

Workload status:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc get pods -n kruize-gpu-rec-apply
NAME                              READY   STATUS            RESTARTS   AGE
human-eval-deployment-job-j84fc   0/1     SchedulingGated   0          5s

describe pod output:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc describe pod human-eval-deployment-job-j84fc -n kruize-gpu-rec-apply
Name:             human-eval-deployment-job-j84fc
Namespace:        kruize-gpu-rec-apply
Priority:         0
Service Account:  default
Node:             <none>
Labels:           batch.kubernetes.io/controller-uid=6663a528-6986-4fb6-8567-c7bb5a913ddf
                  batch.kubernetes.io/job-name=human-eval-deployment-job
                  controller-uid=6663a528-6986-4fb6-8567-c7bb5a913ddf
                  job-name=human-eval-deployment-job
Annotations:      openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:               
IPs:              <none>
Controlled By:    Job/human-eval-deployment-job
Containers:
  human-eval-benchmark:
    Image:      quay.io/kruizehub/human-eval-deployment:latest
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
      memory:                                          4Gi
    Requests:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
      memory:                                          4Gi
    Environment Variables from:
      982d3f10-36c6-4bca-b8a0-6b596a353e83  ConfigMap  Optional: false
    Environment:
      num_prompts:  20000
    Mounts:
      /.cache/huggingface from cache-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-smdnc (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  cache-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cache-pvc
    ReadOnly:   false
  kube-api-access-smdnc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Note: it's working if I change nvidia.com/mig-7g.40gb: 1 in requests and limits to nvidia.com/mig-4g.20gb: 1

Controller logs when tried with 4g.20gb:

{"level":"info","ts":"2024-11-27T09:23:28.514660632Z","caller":"controller/instaslice_controller.go:203","msg":"finalizer deleted for failed for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-j84fc","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-j84fc","reconcileID":"27a11b3d-f8af-4fb2-9c2e-28d9bff3f2f8","pod":"human-eval-deployment-job-j84fc"}
{"level":"info","ts":"2024-11-27T09:24:15.329605712Z","caller":"controller/capacity.go:48","msg":"cpu request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"6a951761-a3e8-44ae-b15e-cff383da295d","pod":"human-eval-deployment-job-xs4tz","value":2}
{"level":"info","ts":"2024-11-27T09:24:15.329719412Z","caller":"controller/capacity.go:56","msg":"memory request obtained ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"6a951761-a3e8-44ae-b15e-cff383da295d","pod":"human-eval-deployment-job-xs4tz","value":4294967296}
{"level":"info","ts":"2024-11-27T09:24:15.329761712Z","caller":"controller/instaslice_controller.go:427","msg":"allocation obtained for ","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"6a951761-a3e8-44ae-b15e-cff383da295d","pod":"human-eval-deployment-job-xs4tz"}
{"level":"error","ts":"2024-11-27T09:24:15.594604284Z","caller":"controller/instaslice_controller.go:763","msg":"error ungating pod","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"349e816d-b02e-4ba0-ba89-dc438e619177","error":"Operation cannot be fulfilled on pods \"human-eval-deployment-job-xs4tz\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/openshift/instaslice-operator/internal/controller.(*InstasliceReconciler).addNodeSelectorAndUngatePod\n\t/workspace/internal/controller/instaslice_controller.go:763\ngithub.com/openshift/instaslice-operator/internal/controller.(*InstasliceReconciler).Reconcile\n\t/workspace/internal/controller/instaslice_controller.go:400\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}
{"level":"info","ts":"2024-11-27T09:24:15.59478036Z","caller":"controller/controller.go:314","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"349e816d-b02e-4ba0-ba89-dc438e619177"}
{"level":"error","ts":"2024-11-27T09:24:15.594807119Z","caller":"controller/controller.go:316","msg":"Reconciler error","controller":"InstaSlice-controller","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"human-eval-deployment-job-xs4tz","namespace":"kruize-gpu-rec-apply"},"namespace":"kruize-gpu-rec-apply","name":"human-eval-deployment-job-xs4tz","reconcileID":"349e816d-b02e-4ba0-ba89-dc438e619177","error":"Operation cannot be fulfilled on pods \"human-eval-deployment-job-xs4tz\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}

Daemonset logs after applying 4g.20gb:

{"level":"info","ts":"2024-11-27T09:24:15.345122243Z","caller":"controller/instaslice_daemonset.go:162","msg":"creating allocation for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","pod":"human-eval-deployment-job-xs4tz"}
{"level":"info","ts":"2024-11-27T09:24:15.345971011Z","caller":"controller/instaslice_daemonset.go:221","msg":"The profile id is","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","giProfileInfo":5,"Memory":19968,"pod":"b218f0de-b0b3-43b2-ba3e-06129faae25e"}
{"level":"info","ts":"2024-11-27T09:24:15.34631395Z","caller":"controller/instaslice_daemonset.go:881","msg":"creating slice for","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","pod":"human-eval-deployment-job-xs4tz"}
{"level":"info","ts":"2024-11-27T09:24:15.514907103Z","caller":"controller/instaslice_daemonset.go:717","msg":"ConfigMap not found, creating for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","name":"7d9c93e0-e9b0-4562-8b12-8678019a0a5b","migGPUUUID":"MIG-ae3fdfff-d866-5cd6-a79d-94902fa9c5a0"}
{"level":"info","ts":"2024-11-27T09:24:15.534308794Z","caller":"controller/instaslice_daemonset.go:251","msg":"done creating mig slice for ","controller":"InstaSliceDaemonSet","controllerGroup":"inference.redhat.com","controllerKind":"Instaslice","Instaslice":{"name":"wrk-5","namespace":"instaslice-system"},"namespace":"instaslice-system","name":"wrk-5","reconcileID":"ff2dd559-f963-41fd-9a6b-c8ab7761ebc5","pod":"human-eval-deployment-job-xs4tz","parentgpu":"GPU-15ea50a3-01fd-b823-2c66-0e247db67a7d","miguuid":"MIG-ae3fdfff-d866-5cd6-a79d-94902fa9c5a0"}

Pod running with 4g.20gb:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc get pods -n kruize-gpu-rec-apply
NAME                              READY   STATUS    RESTARTS   AGE
human-eval-deployment-job-xs4tz   1/1     Running   0          2m15s

Pod describe:

[abharath@abharath-thinkpadt14sgen2i nerc]$ oc describe pod human-eval-deployment-job-xs4tz -n kruize-gpu-rec-apply
Name:             human-eval-deployment-job-xs4tz
Namespace:        kruize-gpu-rec-apply
Priority:         0
Service Account:  default
Node:             wrk-5/192.168.50.93
Start Time:       Wed, 27 Nov 2024 14:54:16 +0530
Labels:           batch.kubernetes.io/controller-uid=e6faf00e-7d7e-4026-bac1-ea4c42d868d8
                  batch.kubernetes.io/job-name=human-eval-deployment-job
                  controller-uid=e6faf00e-7d7e-4026-bac1-ea4c42d868d8
                  job-name=human-eval-deployment-job
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["10.129.4.53/23"],"mac_address":"0a:58:0a:81:04:35","gateway_ips":["10.129.4.1"],"routes":[{"dest":"10.128.0.0...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "10.129.4.53"
                        ],
                        "mac": "0a:58:0a:81:04:35",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.129.4.53
IPs:
  IP:           10.129.4.53
Controlled By:  Job/human-eval-deployment-job
Containers:
  human-eval-benchmark:
    Container ID:   cri-o://b5747f5fc8ee06c42757f433742e6d50f9d183d32d9e868c9d71e5a528d4231c
    Image:          quay.io/kruizehub/human-eval-deployment:latest
    Image ID:       quay.io/kruizehub/human-eval-deployment@sha256:002649f767f242834c7349fd01d85f9929ef215fe7676bdf3cbc832049a130fd
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Wed, 27 Nov 2024 14:54:21 +0530
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  20Gi
      instaslice.redhat.com/mig-4g.20gb:               1
      memory:                                          4Gi
    Requests:
      cpu:                                             2
      instaslice.redhat.com/accelerator-memory-quota:  20Gi
      instaslice.redhat.com/mig-4g.20gb:               1
      memory:                                          4Gi
    Environment Variables from:
      7d9c93e0-e9b0-4562-8b12-8678019a0a5b  ConfigMap  Optional: false
    Environment:
      num_prompts:  20000
    Mounts:
      /.cache/huggingface from cache-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dz6v9 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  cache-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  cache-pvc
    ReadOnly:   false
  kube-api-access-dz6v9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/hostname=wrk-5
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age    From                     Message
  ----     ------                  ----   ----                     -------
  Warning  FailedScheduling        2m48s  default-scheduler        0/9 nodes are available: persistentvolumeclaim "cache-pvc" not found. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.
  Normal   Scheduled               2m46s  default-scheduler        Successfully assigned kruize-gpu-rec-apply/human-eval-deployment-job-xs4tz to wrk-5
  Normal   SuccessfulAttachVolume  2m46s  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ba228d4b-9bca-4b31-9d71-1f670fb54427"
  Normal   AddedInterface          2m43s  multus                   Add eth0 [10.129.4.53/23] from ovn-kubernetes
  Normal   Pulled                  2m43s  kubelet                  Container image "quay.io/kruizehub/human-eval-deployment:latest" already present on machine
  Normal   Created                 2m42s  kubelet                  Created container human-eval-benchmark
  Normal   Started                 2m42s  kubelet                  Started container human-eval-benchmark
asm582 commented 4 days ago

Thanks for this issue. Can you share nvidia-smi -L output before the slice creation and after slice creation.

asm582 commented 4 days ago

FYI, using the main branch on a KinD cluster, I am able to create 7g.40gb slice:

nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-31cfe05c-ed13-cd17-d7aa-c63db5108c24)
  MIG 7g.40gb     Device  0: (UUID: MIG-bd1776d4-5118-545c-8e87-30fde4a42225)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-8d042338-e67f-9c48-92b4-5b55c7e5133c)
(base) openstack@netsres62:~/asmalvan/gpu_pack/instaslice-operator$ kubectl describe pod
Name:             cuda-vectoradd-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             kind-control-plane/172.18.0.2
Start Time:       Wed, 27 Nov 2024 04:40:53 -0500
Labels:           <none>
Annotations:      <none>
Status:           Running
IP:               10.244.0.27
IPs:
  IP:  10.244.0.27
Containers:
  cuda-vectoradd-0:
    Container ID:  containerd://967df508228e456d9f83312dbf254c5e146a4c2281aff48deff886e7b3dffb5d
    Image:         quay.io/tardieu/vectoradd:0.1.0
    Image ID:      quay.io/tardieu/vectoradd@sha256:4d8d95ec884480d489056f3a8b202d4aeea744e4a0a481a20b90009614d40244
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      nvidia-smi -L; ./vectorAdd && sleep 1800
    State:          Running
      Started:      Wed, 27 Nov 2024 04:41:01 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
    Requests:
      instaslice.redhat.com/accelerator-memory-quota:  40Gi
      instaslice.redhat.com/mig-7g.40gb:               1
    Environment Variables from:
      698f3e41-8f19-46f0-82f0-bd759fcb478f  ConfigMap  Optional: false
    Environment:                            <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dprt9 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  kube-api-access-dprt9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=kind-control-plane
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  11s   default-scheduler  Successfully assigned default/cuda-vectoradd-0 to kind-control-plane
  Normal  Pulling    10s   kubelet            Pulling image "quay.io/tardieu/vectoradd:0.1.0"
  Normal  Pulled     4s    kubelet            Successfully pulled image "quay.io/tardieu/vectoradd:0.1.0" in 6.064s (6.064s including waiting). Image size: 30691624 bytes.
  Normal  Created    3s    kubelet            Created container cuda-vectoradd-0
  Normal  Started    3s    kubelet            Started container cuda-vectoradd-0
bharathappali commented 4 days ago

Thanks @asm582 I'll try with the main branch build.