Open bharathappali opened 4 days ago
Thanks for this issue. Can you share nvidia-smi -L
output before the slice creation and after slice creation.
FYI, using the main branch on a KinD cluster, I am able to create 7g.40gb slice:
nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-31cfe05c-ed13-cd17-d7aa-c63db5108c24)
MIG 7g.40gb Device 0: (UUID: MIG-bd1776d4-5118-545c-8e87-30fde4a42225)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-8d042338-e67f-9c48-92b4-5b55c7e5133c)
(base) openstack@netsres62:~/asmalvan/gpu_pack/instaslice-operator$ kubectl describe pod
Name: cuda-vectoradd-0
Namespace: default
Priority: 0
Service Account: default
Node: kind-control-plane/172.18.0.2
Start Time: Wed, 27 Nov 2024 04:40:53 -0500
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.244.0.27
IPs:
IP: 10.244.0.27
Containers:
cuda-vectoradd-0:
Container ID: containerd://967df508228e456d9f83312dbf254c5e146a4c2281aff48deff886e7b3dffb5d
Image: quay.io/tardieu/vectoradd:0.1.0
Image ID: quay.io/tardieu/vectoradd@sha256:4d8d95ec884480d489056f3a8b202d4aeea744e4a0a481a20b90009614d40244
Port: <none>
Host Port: <none>
Command:
sh
-c
nvidia-smi -L; ./vectorAdd && sleep 1800
State: Running
Started: Wed, 27 Nov 2024 04:41:01 -0500
Ready: True
Restart Count: 0
Limits:
instaslice.redhat.com/accelerator-memory-quota: 40Gi
instaslice.redhat.com/mig-7g.40gb: 1
Requests:
instaslice.redhat.com/accelerator-memory-quota: 40Gi
instaslice.redhat.com/mig-7g.40gb: 1
Environment Variables from:
698f3e41-8f19-46f0-82f0-bd759fcb478f ConfigMap Optional: false
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dprt9 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-dprt9:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=kind-control-plane
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned default/cuda-vectoradd-0 to kind-control-plane
Normal Pulling 10s kubelet Pulling image "quay.io/tardieu/vectoradd:0.1.0"
Normal Pulled 4s kubelet Successfully pulled image "quay.io/tardieu/vectoradd:0.1.0" in 6.064s (6.064s including waiting). Image size: 30691624 bytes.
Normal Created 3s kubelet Created container cuda-vectoradd-0
Normal Started 3s kubelet Started container cuda-vectoradd-0
Thanks @asm582 I'll try with the main branch build.
I was trying to create dynamic slices with instaslice on a openshift cluster which has a node with 4 A100 GPU's. I found that instaslice is creating MIG for any config less than
7g.40gb
but it's not able to create a MIG for7g.40gb
I have tried the same workload with
7g.40gb
and4g.20gb
slice and here are the details.Instaslice image built from the branch
release-4.19
Node allocatable resources:
Instaslice controller logs:
workload yaml:
Workload status:
describe pod output:
Note: it's working if I change
nvidia.com/mig-7g.40gb: 1
in requests and limits tonvidia.com/mig-4g.20gb: 1
Controller logs when tried with
4g.20gb
:Daemonset logs after applying
4g.20gb
:Pod running with
4g.20gb
:Pod describe: