Closed veluruchaithanya closed 4 months ago
Thanks for filing the issue. Is there any requirement for cert-manager on openshift? Would it be possible to delete the pod?
Hi, I removed the cert-manager pods on openShift. I got the same error as earlier when installing the SGX device plugin.
got it. could you check the result for this api
oc api-resources | grep MutatingWebhookConfiguration
This is the result
NAME SHORTNAMES APIVERSION NAMESPACED KIND
mutatingwebhookconfigurations admissionregistration.k8s.io/v1 false MutatingWebhookConfiguration
is cert-manager installed with an operator? would it be possible to uninstall the operator, that might delete all the associated resources
I removed cert-manager pods earlier with below command and it removed all the associated resources for cert-manager operator.
$ oc delete deployment -n cert-manager -l app.kubernetes.io/instance=cert-manager
This procedure is mentioned in below link Uninstalling the cert-manager Operator for Red Hat OpenShift - cert-manager Operator for Red Hat OpenShift | Security and compliance | OpenShift Container Platform 4.14
I found some CRD's and two services related to cert-manager operator and removed them manually. I tried to deploy sgx device plugin and got the same error as before.
SGX device plugin error displays below information service "inteldeviceplugins-webhook-service" not found
I checked the intel related services running in the cluster and I only got one service and webhook service doesn't exist.
oc get service --all-namespaces | grep intel
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
openshift-operators inteldeviceplugins-controller-manager-service ClusterIP xxx.xx.xx.x <none> 443/TCP
Is there a command to install inteldeviceplugins-webhook-service on openshift?
Yes, that is same.
oc get service --all-namespaces | grep intel
openshift-operators inteldeviceplugins-controller-manager-service
usually we don't install any webhook service for sgx device plugin on openshift could you please share the logs of intel device plugins controller manager pod? Is there any specific certification setup in your cluster as such?
oc logs inteldeviceplugins-controller-manager-74d569c794-bfmn7
Defaulted container "manager" out of: manager, kube-rbac-proxy
I0709 22:54:14.706455 1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="deviceplugin.intel.com/v1, Kind=SgxDevicePlugin" path="/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706558 1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706581 1 webhook.go:189] "controller-runtime/builder: Registering a validating webhook" GVK="deviceplugin.intel.com/v1, Kind=SgxDevicePlugin" path="/validate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706603 1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/validate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706664 1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="deviceplugin.intel.com/v1, Kind=GpuDevicePlugin" path="/mutate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706693 1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706712 1 webhook.go:189] "controller-runtime/builder: Registering a validating webhook" GVK="deviceplugin.intel.com/v1, Kind=GpuDevicePlugin" path="/validate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706735 1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/validate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706789 1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="deviceplugin.intel.com/v1, Kind=QatDevicePlugin" path="/mutate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706811 1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706830 1 webhook.go:189] "controller-runtime/builder: Registering a validating webhook" GVK="deviceplugin.intel.com/v1, Kind=QatDevicePlugin" path="/validate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706847 1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/validate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706879 1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="/v1, Kind=Pod" path="/mutate--v1-pod"
I0709 22:54:14.706900 1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate--v1-pod"
I0709 22:54:14.706909 1 webhook.go:204] "controller-runtime/builder: skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called" GVK="/v1, Kind=Pod"
I0709 22:54:14.706932 1 main.go:221] "setup: starting manager"
I0709 22:54:14.707091 1 server.go:185] "controller-runtime/metrics: Starting metrics server"
I0709 22:54:14.707102 1 server.go:50] "intel-device-plugins-manager: starting server" kind="health probe" addr="[::]:8081"
I0709 22:54:14.707168 1 server.go:191] "controller-runtime/webhook: Starting webhook server"
I0709 22:54:14.707180 1 server.go:224] "controller-runtime/metrics: Serving metrics server" bindAddress="127.0.0.1:8080" secure=false
I0709 22:54:14.707399 1 certwatcher.go:161] "controller-runtime/certwatcher: Updated current TLS certificate"
I0709 22:54:14.707496 1 server.go:242] "controller-runtime/webhook: Serving webhook server" host="" port=9443
I0709 22:54:14.707505 1 certwatcher.go:115] "controller-runtime/certwatcher: Starting certificate watcher"
I0709 22:54:15.308148 1 leaderelection.go:250] attempting to acquire leader lease openshift-operators/d1c7b6d5.intel.com...
I0709 22:54:32.789343 1 leaderelection.go:260] successfully acquired lease openshift-operators/d1c7b6d5.intel.com
I0709 22:54:32.789519 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.SgxDevicePlugin"
I0709 22:54:32.789516 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.QatDevicePlugin"
I0709 22:54:32.789559 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.GpuDevicePlugin"
I0709 22:54:32.789546 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.DaemonSet"
I0709 22:54:32.789569 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.DaemonSet"
I0709 22:54:32.789576 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.ClusterRoleBinding"
I0709 22:54:32.789583 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.ServiceAccount"
I0709 22:54:32.789586 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.ClusterRoleBinding"
I0709 22:54:32.789589 1 controller.go:186] "intel-device-plugins-manager: Starting Controller" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin"
I0709 22:54:32.789591 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.DaemonSet"
I0709 22:54:32.789605 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.ClusterRoleBinding"
I0709 22:54:32.789606 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.ServiceAccount"
I0709 22:54:32.789612 1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.ServiceAccount"
I0709 22:54:32.789613 1 controller.go:186] "intel-device-plugins-manager: Starting Controller" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin"
I0709 22:54:32.789619 1 controller.go:186] "intel-device-plugins-manager: Starting Controller" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin"
I0709 22:54:32.892909 1 controller.go:220] "intel-device-plugins-manager: Starting workers" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" worker count=1
I0709 22:54:32.892915 1 controller.go:220] "intel-device-plugins-manager: Starting workers" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" worker count=1
I0709 22:54:32.892913 1 controller.go:220] "intel-device-plugins-manager: Starting workers" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" worker count=1
I0709 22:57:29.103904 1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="67a28586-479d-45b7-bd0c-5959c0f69dec"
I0709 22:59:49.287569 1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="1731abb7-cc6c-4660-b750-aed0587b5066"
I0709 23:00:00.138475 1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-operator-lifecycle-manager/" namespace="openshift-operator-lifecycle-manager" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:kube-system:job-controller" requestID="a5af5e6f-7add-4859-8133-a869efc1d27b"
I0709 23:00:31.879466 1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="67d379e3-dece-4bec-aff1-01aba68ee035"
I0709 23:03:49.123162 1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="a6f961ad-f8c7-41b3-8cc9-bd89f63e9b56"
I0709 23:08:32.431932 1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="7194887f-68d0-4ec5-96f2-7c09253deb67"
I0709 23:09:51.863327 1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="4dad39fb-890f-4ca1-95ab-d23e50002738"
I don't remember configuring any certificate setup in the cluster
Hi @veluruchaithanya to get more info to debug, could you please check few things?
Does your cluster has the api version admissionregistration.k8s.io/v1
? also make sure apiservice
is present.
[ckulkar1@jfz1r09h07 ~]$ oc api-versions | grep admissionregistration
admissionregistration.k8s.io/v1
[ckulkar1@jfz1r09h07 ~]$ oc get apiservice | grep admission
v1.admissionregistration.k8s.io Local True 3h48m
[ckulkar1@jfz1r09h07 ~]$
From the error you posted it seems like the cluster is trying to get the MutatingWebhookConfiguration from "admissionregistration.k8s.io/v1beta1"
. Take a note that the api-version is v1
and not v1beta1
as it was deprecated in k8s version 1.22. Refer here.
Once that is figured, can you please retry installing Certified Intel Device Plugins Operator? and make sure that the MutatingWebhookConfiguration is present?
[ckulkar1@jfz1r09h07 ~]$ oc get MutatingWebhookConfiguration | grep sgx
msgxdeviceplugin.kb.io-mwb87 1 3m19s
sgx.mutator.webhooks.intel.com-xd5xx 1 3m19s
Hi @chaitanya1731, I ran the commands and got the same results as you mentioned
$ oc api-versions | grep admissionregistration
admissionregistration.k8s.io/v1
$ oc get apiservice | grep admission
v1.admissionregistration.k8s.io Local True 62d
$ oc get MutatingWebhookConfiguration | grep sgx
msgxdeviceplugin.kb.io-pktxg 1 4m11s
sgx.mutator.webhooks.intel.com-gxjqj 1 4m11s
I fixed the v1beta1
version issue by deploying cert-manager with latest version v1.15.1
and cert-manager pod is not failing now
$ oc get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-bd44d64d-lz4cd 1/1 Running 0 19m
cert-manager-cainjector-7dcddbd8b9-nw9wq 1/1 Running 0 19m
cert-manager-webhook-6cc8fdfd7-8m2tt 1/1 Running 0 19m
I still got error while deploying intel SGX device plugin
$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/v1.2.1/device_plugins/sgx_device_plugin.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/v1.2.1/device_plugins/sgx_device_plugin.yaml": Internal error occurred: failed calling webhook "msgxdeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.inteldeviceplugins-system.svc:443/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin?timeout=10s": service "inteldeviceplugins-webhook-service" not found
Thanks. So the cert-manager and the api issue is resolved..
Just to make sure about the operator installation is correct, can you confirm if all the artifacts are installed in correct namespaces?for example.. Intel Device Plugins Operator in openshift-operators
namespace, NFD operator and its nfd-discovery and nfd-rule yamls in openshift-nfd
?
Hi, I can confirm that all the artifacts are installed in the correct namespaces. I am displaying the output below.
$ oc get operator
NAME AGE
intel-device-plugins-operator.openshift-operators 17h
intel-device-plugins-operator.openshiftoperators 17h
local-storage-operator.openshift-local-storage 63d
lvms-operator.openshift-storage 61d
nfd.openshift-nfd 2d1h
ptp-operator.openshift-ptp 63d
sriov-network-operator.openshift-sriov-network-operator 63d
$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-dd9cfbfc-wm7xq 2/2 Running 2 (17h ago) 2d1h
nfd-master-85ccb585b-44r97 1/1 Running 0 2d1h
nfd-worker-njz7w 1/1 Running 0 2d1h
$ oc get pods -n openshift-operators
NAME READY STATUS RESTARTS AGE
inteldeviceplugins-controller-manager-6fdccbfb9f-lv2kn 2/2 Running 1 (17h ago) 18h
$ oc get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-bd44d64d-lz4cd 1/1 Running 0 18h
cert-manager-cainjector-7dcddbd8b9-nw9wq 1/1 Running 0 18h
cert-manager-webhook-6cc8fdfd7-8m2tt 1/1 Running 0 18h
$ oc describe node <sgx_node_name> | grep sgx
feature.node.kubernetes.io/cpu-security.sgx.enabled=true
feature.node.kubernetes.io/cpu-sgx.enabled=true
intel.feature.node.kubernetes.io/sgx=true
Thanks. @veluruchaithanya We attempted to replicate this issue on both 4.14.30 and 4.14.31 but were unsuccessful. The SGX plugin worked correctly according to the given instructions.. It's worth noting that we do not utilize cert-manager for OpenShift. It appears that cert-manager may be causing misconfigurations on your OpenShift cluster, potentially causing some issues in the correct installation of the Intel Device Plugins Operator. I've observed that cert-manager has its own operator for OpenShift, which is different from the one installed on your cluster. Our Intel device plugins controller manager logs align with those you've shared, indicating that the operator installation appears correct. However, due to potential misconfigurations, it isn't functioning as expected. Could you please consider re-provisioning the cluster and following the same process again (without cert-manager) to see if the issues gets resolved?
Hi @chaitanya1731 Have you tried to replicate the issue on the single node openshift (SNO) cluster?
Yes. We tried on SNO.
Hi, I removed the cert-manager and intel-device-plugins-operator. After that, I checked for old webhook plugins in the api-resources "MutatingWebhookConfiguration" and "ValidatingWebhookConfiguration". They had below webhooks
MutatingWebhookConfiguration
NAME WEBHOOKS AGE
inteldeviceplugins-mutating-webhook-configuration 9 16d
inteldeviceplugins-webhook 1 15d
mutating-webhook-configuration 9 16d
ValidatingWebhookConfiguration
NAME WEBHOOKS AGE
inteldeviceplugins-validating-webhook-configuration 7 16d
validating-webhook-configuration 7 16d
After removing above webhooks, I tried installing intel-device-plugins-operator and intel SGX device plugin. The SGX plugin has been successfully installed this time.
$ oc get pods
NAME READY STATUS RESTARTS AGE
intel-sgx-plugin-tzrqw 1/1 Running 0 7m13s
inteldeviceplugins-controller-manager-878cd6cc-f7kj8 2/2 Running 0 8m1s
$ oc get SgxDevicePlugin
NAME DESIRED READY NODE SELECTOR AGE
sgxdeviceplugin-sample 1 1 {"intel.feature.node.kubernetes.io/sgx":"true"} 7m18s
@chaitanya1731 @vbedida79 Thanks for help.
Looks like it was some misconfiguration due to cert-manager. Thanks for confirming and happy to help. Closing this now.
resolved
Could you please consider re-provisioning the cluster and following the same process again (without cert-manager) to see if the issues gets resolved?
The device plugins "upstream" docs refer to cert-manager usage to get webhooks' TLS certs provisioned. However, that is not applicable to OpenShift because OLM takes care of that.
It sounds there is a potential error with cert-manager + OLM co-existence when setting up the Device Plugins operator.
Hi, I am trying to enable SGX secure enclaves on a server platform. This server supports Intel SGX and contains a 4th Gen Intel Xeon Scalable Processor. The below BIOS configuration update was done to enable SGX.
The Red Hat Single Node OpenShift Platform (SNO Cluster Version 4.14.23) was deployed on the server. After that, I installed the below versions of dependent operators on the Red Hat SNO cluster. NFD Operator Version: 4.14.0-202406180839 Intel Device Plugins Operator Version: 0.28.0
The dependent operator's versions are selected as per release table
While deploying the SGX plugin I got the below error
I looked if any pods are failing in the existing namespaces and I see below pod is failing in namespace cert-manager.
Can someone help with troubleshooting this issue?