Closed rbo closed 3 years ago
Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.
Adjusted
/etc/dhcp/dhcpd.conf
/etc/named/10.16.172.in-addr.arpa.zone
/etc/named/ocp4.stormshift.coe.muc.redhat.com.zone
on ocp4support
risk of data loss updating "nodefeaturediscoveries.nfd.openshift.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD
OpenShift Version: 4.8.17 NFD Operator version: 4.8.0-202110262219
https://access.redhat.com/solutions/6097251
[root@ocp4support ocp4install]# oc get crd | grep node
nodefeaturediscoveries.nfd.openshift.io 2020-08-13T13:04:00Z
[root@ocp4support ocp4install]# oc get -A nodefeaturediscoveries.nfd.openshift.io
No resources found
[root@ocp4support ocp4install]# oc delete crd/nodefeaturediscoveries.nfd.openshift.io
customresourcedefinition.apiextensions.k8s.io "nodefeaturediscoveries.nfd.openshift.io" deleted
SOLVED
It looks there is an old entitlement available:
[root@ocp4support ocp4install]# oc get mc | grep enti
50-entitlement-key-pem 2.2.0 352d
50-entitlement-pem 2.2.0 352d
Applied new entitlement based on https://docs.nvidia.com/datacenter/cloud-native/openshift/cluster-entitlement.html#obtaining-an-entitlement-certificate
We have problems with the MachineConfig rollout of the entitlement.
$ oc get -o yaml nodes/compute-0.ocp4.stormshift.coe.muc.redhat.com| grep 'machineconfiguration.openshift.io/reason'
machineconfiguration.openshift.io/reason: 'failed to drain node : compute-0.ocp4.stormshift.coe.muc.redhat.com
Try to drain by hand:
$ oc adm drain --ignore-daemonsets --delete-emptydir-data compute-0.ocp4.stormshift.coe.muc.redhat.com
...
error when evicting pods/"rook-ceph-osd-3-7b566d7688-4297s" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod openshift-storage/rook-ceph-osd-3-7b566d7688-4297s
Can not evict OCS/ODF storage pod....
OCS/ODF looks not so good:
rook-ceph-osd-0-56fcc8d95-9zp7n 1/2 CrashLoopBackOff 221 13h
rook-ceph-osd-1-7756689d78-bbztl 2/2 Running 0 2d16h
rook-ceph-osd-3-7b566d7688-d8pjv 0/2 Pending 0 64s
And the cluster is a overloaded:
$ oc describe no -l node-role.kubernetes.io/worker= |grep -A 7 "Allocated resources:"
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 12582m (81%) 18050m (116%)
memory 48224Mi (37%) 74792Mi (58%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
--
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 15193m (98%) 19700m (127%)
memory 64547Mi (50%) 77484Mi (60%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
--
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3594m (23%) 5 (32%)
memory 9816Mi (7%) 9440Mi (7%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
--
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3065m (87%) 10800m (308%)
memory 7861Mi (52%) 17252Mi (115%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Entitlement rolled out successfully.
NVIDIA Operator installed and configured successfully:
$ cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
pod/cuda-vectoradd created
$ oc logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Please add a new node to Cluster OCP4 with a GPU. GPU is available on Storm2 with PCI Passthrough.