Add GPU to OCP4 - Githubissues

rbo commented 3 years ago

Please add a new node to Cluster OCP4 with a GPU. GPU is available on Storm2 with PCI Passthrough.

[x] Create new node: gpu.ocp4.stormshift.coe.muc.redhat.com
- [x] Added DNS A and PTR Records
- [x] Added dhcp static ip config
- [x] Create VM on storm2.coe.muc.redhat.com
- [x] Attach NVidia GPU to VM
- [x] Added node to OpensShift Cluster
- [x] Install node feature discovery operator
- [x] Role out cluster entitlement for NVidia GPU Operator
- [x] Install NVidia GPU Operator

github-actions[bot] commented 3 years ago

Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.

rbo commented 3 years ago

Adjusted

/etc/dhcp/dhcpd.conf
/etc/named/10.16.172.in-addr.arpa.zone
/etc/named/ocp4.stormshift.coe.muc.redhat.com.zone

on ocp4support

rbo commented 3 years ago

Problem during NFD Operator installation:

risk of data loss updating "nodefeaturediscoveries.nfd.openshift.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD

OpenShift Version: 4.8.17 NFD Operator version: 4.8.0-202110262219

Solution

https://access.redhat.com/solutions/6097251

Uninstall NFD Operator

Delete OLD CRD

[root@ocp4support ocp4install]# oc get crd | grep node
nodefeaturediscoveries.nfd.openshift.io                                2020-08-13T13:04:00Z
[root@ocp4support ocp4install]# oc get -A nodefeaturediscoveries.nfd.openshift.io
No resources found
[root@ocp4support ocp4install]# oc delete crd/nodefeaturediscoveries.nfd.openshift.io
customresourcedefinition.apiextensions.k8s.io "nodefeaturediscoveries.nfd.openshift.io" deleted

Install NFD Operator

SOLVED

rbo commented 3 years ago

Surprise Surprise, compute-0 already have a GPU...

rbo commented 3 years ago

Configure cluster wide entitlement

It looks there is an old entitlement available:

[root@ocp4support ocp4install]# oc get mc | grep enti
50-entitlement-key-pem                                                                        2.2.0             352d
50-entitlement-pem                                                                            2.2.0             352d

Applied new entitlement based on https://docs.nvidia.com/datacenter/cloud-native/openshift/cluster-entitlement.html#obtaining-an-entitlement-certificate

rbo commented 3 years ago

We have problems with the MachineConfig rollout of the entitlement.

$ oc get -o yaml nodes/compute-0.ocp4.stormshift.coe.muc.redhat.com| grep 'machineconfiguration.openshift.io/reason'
    machineconfiguration.openshift.io/reason: 'failed to drain node : compute-0.ocp4.stormshift.coe.muc.redhat.com

Try to drain by hand:

$ oc adm drain --ignore-daemonsets --delete-emptydir-data compute-0.ocp4.stormshift.coe.muc.redhat.com
...
error when evicting pods/"rook-ceph-osd-3-7b566d7688-4297s" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod openshift-storage/rook-ceph-osd-3-7b566d7688-4297s

Can not evict OCS/ODF storage pod....

rbo commented 3 years ago

OCS/ODF looks not so good:

rook-ceph-osd-0-56fcc8d95-9zp7n                                   1/2     CrashLoopBackOff   221        13h
rook-ceph-osd-1-7756689d78-bbztl                                  2/2     Running            0          2d16h
rook-ceph-osd-3-7b566d7688-d8pjv                                  0/2     Pending            0          64s

And the cluster is a overloaded:

$ oc describe no -l node-role.kubernetes.io/worker= |grep -A 7 "Allocated resources:"
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                12582m (81%)   18050m (116%)
  memory             48224Mi (37%)  74792Mi (58%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                15193m (98%)   19700m (127%)
  memory             64547Mi (50%)  77484Mi (60%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3594m (23%)  5 (32%)
  memory             9816Mi (7%)  9440Mi (7%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3065m (87%)   10800m (308%)
  memory             7861Mi (52%)  17252Mi (115%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

rbo commented 3 years ago

Entitlement rolled out successfully.

NVIDIA Operator installed and configured successfully:

$  cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF
pod/cuda-vectoradd created

$  oc logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

https://docs.nvidia.com/datacenter/cloud-native/openshift/install-gpu-ocp.html#running-a-sample-gpu-application

stormshift / support

Add GPU to OCP4 #50

Problem during NFD Operator installation:

Solution

Surprise Surprise, compute-0 already have a GPU...

Configure cluster wide entitlement