stormshift / support

This repo should serve as a central source for reporting issues with stormshift
GNU General Public License v3.0
3 stars 0 forks source link

Add GPU to OCP4 #50

Closed rbo closed 3 years ago

rbo commented 3 years ago

Please add a new node to Cluster OCP4 with a GPU. GPU is available on Storm2 with PCI Passthrough.

github-actions[bot] commented 3 years ago

Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.

rbo commented 3 years ago

Adjusted

on ocp4support

image

rbo commented 3 years ago

Problem during NFD Operator installation:

risk of data loss updating "nodefeaturediscoveries.nfd.openshift.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD

OpenShift Version: 4.8.17 NFD Operator version: 4.8.0-202110262219

Solution

https://access.redhat.com/solutions/6097251

SOLVED

rbo commented 3 years ago

Surprise Surprise, compute-0 already have a GPU...

image

rbo commented 3 years ago

Configure cluster wide entitlement

It looks there is an old entitlement available:

[root@ocp4support ocp4install]# oc get mc | grep enti
50-entitlement-key-pem                                                                        2.2.0             352d
50-entitlement-pem                                                                            2.2.0             352d

Applied new entitlement based on https://docs.nvidia.com/datacenter/cloud-native/openshift/cluster-entitlement.html#obtaining-an-entitlement-certificate

rbo commented 3 years ago

We have problems with the MachineConfig rollout of the entitlement.

$ oc get -o yaml nodes/compute-0.ocp4.stormshift.coe.muc.redhat.com| grep 'machineconfiguration.openshift.io/reason'
    machineconfiguration.openshift.io/reason: 'failed to drain node : compute-0.ocp4.stormshift.coe.muc.redhat.com

Try to drain by hand:

$ oc adm drain --ignore-daemonsets --delete-emptydir-data compute-0.ocp4.stormshift.coe.muc.redhat.com
...
error when evicting pods/"rook-ceph-osd-3-7b566d7688-4297s" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod openshift-storage/rook-ceph-osd-3-7b566d7688-4297s

Can not evict OCS/ODF storage pod....

rbo commented 3 years ago

OCS/ODF looks not so good:

rook-ceph-osd-0-56fcc8d95-9zp7n                                   1/2     CrashLoopBackOff   221        13h
rook-ceph-osd-1-7756689d78-bbztl                                  2/2     Running            0          2d16h
rook-ceph-osd-3-7b566d7688-d8pjv                                  0/2     Pending            0          64s

And the cluster is a overloaded:

$ oc describe no -l node-role.kubernetes.io/worker= |grep -A 7 "Allocated resources:"
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                12582m (81%)   18050m (116%)
  memory             48224Mi (37%)  74792Mi (58%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                15193m (98%)   19700m (127%)
  memory             64547Mi (50%)  77484Mi (60%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3594m (23%)  5 (32%)
  memory             9816Mi (7%)  9440Mi (7%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3065m (87%)   10800m (308%)
  memory             7861Mi (52%)  17252Mi (115%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
rbo commented 3 years ago

Entitlement rolled out successfully.

NVIDIA Operator installed and configured successfully:

$  cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF
pod/cuda-vectoradd created

$  oc logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

https://docs.nvidia.com/datacenter/cloud-native/openshift/install-gpu-ocp.html#running-a-sample-gpu-application