intel / intel-technology-enabling-for-openshift

The project focuses on Intel’s enterprise AI and cloud native foundation for Red Hat OpenShift Container Platform (RHOCP) solution enablement and innovation including Intel data center hardware features, Intel technology enhanced AI platform and the referenced AI workloads provisioning for OpenShift.
https://intel.github.io/intel-technology-enabling-for-openshift/
Apache License 2.0
17 stars 12 forks source link

Installing Intel SGX plugin on Red Hat Single Node Openshift #278

Closed veluruchaithanya closed 4 months ago

veluruchaithanya commented 4 months ago

Hi, I am trying to enable SGX secure enclaves on a server platform. This server supports Intel SGX and contains a 4th Gen Intel Xeon Scalable Processor. The below BIOS configuration update was done to enable SGX.

Socket Configuration -> Processor Configuration -> Memory Encryption (TME) -> Enabled
Socket Configuration -> Processor Configuration -> Total Memory Encryption (TME) Bypass -> Disabled
Socket Configuration -> Processor Configuration -> Total Memory Encryption Multi-Tenant(TME-MT) -> Enabled
Socket Configuration -> Processor Configuration -> SW Guard Extensions (SGX) -> Enabled
Socket Configuration -> Processor Configuration -> PRM Size for SGX -> 1GB

The Red Hat Single Node OpenShift Platform (SNO Cluster Version 4.14.23) was deployed on the server. After that, I installed the below versions of dependent operators on the Red Hat SNO cluster. NFD Operator Version: 4.14.0-202406180839 Intel Device Plugins Operator Version: 0.28.0

The dependent operator's versions are selected as per release table

While deploying the SGX plugin I got the below error

 $ oc get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
inteldeviceplugins-controller-manager-998555bf7-qgw64   2/2     Running   0          24h

$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/v1.2.1/device_plugins/sgx_device_plugin.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/v1.2.1/device_plugins/sgx_device_plugin.yaml": Internal error occurred: failed calling webhook "msgxdeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.inteldeviceplugins-system.svc:443/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin?timeout=10s": service "inteldeviceplugins-webhook-service" not found

I looked if any pods are failing in the existing namespaces and I see below pod is failing in namespace cert-manager.

$ oc get pods -n cert-manager
NAME                                       READY   STATUS             RESTARTS           AGE
cert-manager-5b9698ccf9-lfvns              1/1     Running            0                  11d
cert-manager-cainjector-69cb5b95dd-p9psx   0/1     CrashLoopBackOff   3081 (4m41s ago)   11d
cert-manager-webhook-7595fbd8bc-trh7v      1/1     Running            0                  11d

$ oc logs cert-manager-cainjector-69cb5b95dd-p9psx -n cert-manager | grep error
E0709 19:25:46.295360       1 start.go:138] cert-manager/ca-injector "msg"="manager goroutine exited" "error"=null  
E0709 19:25:50.694476       1 start.go:170] cert-manager/ca-injector "msg"="Error registering certificate based controllers. Retrying after 5 seconds." "error"="no matches for kind \"MutatingWebhookConfiguration\" in version \"admissionregistration.k8s.io/v1beta1\""  
Error: error registering secret controller: no matches for kind "MutatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1beta1"
      --alsologtostderr                           log to standard error as well as files
      --logtostderr                               log to standard error instead of files (default true)
error registering secret controller: no matches for kind "MutatingWebhookConfiguration" in version "admissionregistration.k8s.io/v1beta1"

Can someone help with troubleshooting this issue?

vbedida79 commented 4 months ago

Thanks for filing the issue. Is there any requirement for cert-manager on openshift? Would it be possible to delete the pod?

veluruchaithanya commented 4 months ago

Hi, I removed the cert-manager pods on openShift. I got the same error as earlier when installing the SGX device plugin.

vbedida79 commented 4 months ago

got it. could you check the result for this api

oc api-resources | grep MutatingWebhookConfiguration

veluruchaithanya commented 4 months ago

This is the result

NAME                             SHORTNAMES               APIVERSION                                    NAMESPACED   KIND
 mutatingwebhookconfigurations                            admissionregistration.k8s.io/v1               false        MutatingWebhookConfiguration
vbedida79 commented 4 months ago

is cert-manager installed with an operator? would it be possible to uninstall the operator, that might delete all the associated resources

veluruchaithanya commented 4 months ago

I removed cert-manager pods earlier with below command and it removed all the associated resources for cert-manager operator.

$ oc delete deployment -n cert-manager -l app.kubernetes.io/instance=cert-manager

This procedure is mentioned in below link Uninstalling the cert-manager Operator for Red Hat OpenShift - cert-manager Operator for Red Hat OpenShift | Security and compliance | OpenShift Container Platform 4.14

veluruchaithanya commented 4 months ago

I found some CRD's and two services related to cert-manager operator and removed them manually. I tried to deploy sgx device plugin and got the same error as before.

veluruchaithanya commented 4 months ago

SGX device plugin error displays below information service "inteldeviceplugins-webhook-service" not found

I checked the intel related services running in the cluster and I only got one service and webhook service doesn't exist.

oc get service --all-namespaces | grep intel

NAMESPACE                                          NAME                                            TYPE           CLUSTER-IP       EXTERNAL-IP                            PORT(S)  
openshift-operators                                inteldeviceplugins-controller-manager-service   ClusterIP      xxx.xx.xx.x      <none>                                 443/TCP     

Is there a command to install inteldeviceplugins-webhook-service on openshift?

vbedida79 commented 4 months ago

Yes, that is same.

oc get service --all-namespaces | grep intel
openshift-operators                                inteldeviceplugins-controller-manager-service  

usually we don't install any webhook service for sgx device plugin on openshift could you please share the logs of intel device plugins controller manager pod? Is there any specific certification setup in your cluster as such?

veluruchaithanya commented 4 months ago
oc logs inteldeviceplugins-controller-manager-74d569c794-bfmn7

Defaulted container "manager" out of: manager, kube-rbac-proxy
I0709 22:54:14.706455       1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="deviceplugin.intel.com/v1, Kind=SgxDevicePlugin" path="/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706558       1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706581       1 webhook.go:189] "controller-runtime/builder: Registering a validating webhook" GVK="deviceplugin.intel.com/v1, Kind=SgxDevicePlugin" path="/validate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706603       1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/validate-deviceplugin-intel-com-v1-sgxdeviceplugin"
I0709 22:54:14.706664       1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="deviceplugin.intel.com/v1, Kind=GpuDevicePlugin" path="/mutate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706693       1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706712       1 webhook.go:189] "controller-runtime/builder: Registering a validating webhook" GVK="deviceplugin.intel.com/v1, Kind=GpuDevicePlugin" path="/validate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706735       1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/validate-deviceplugin-intel-com-v1-gpudeviceplugin"
I0709 22:54:14.706789       1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="deviceplugin.intel.com/v1, Kind=QatDevicePlugin" path="/mutate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706811       1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706830       1 webhook.go:189] "controller-runtime/builder: Registering a validating webhook" GVK="deviceplugin.intel.com/v1, Kind=QatDevicePlugin" path="/validate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706847       1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/validate-deviceplugin-intel-com-v1-qatdeviceplugin"
I0709 22:54:14.706879       1 webhook.go:158] "controller-runtime/builder: Registering a mutating webhook" GVK="/v1, Kind=Pod" path="/mutate--v1-pod"
I0709 22:54:14.706900       1 server.go:183] "controller-runtime/webhook: Registering webhook" path="/mutate--v1-pod"
I0709 22:54:14.706909       1 webhook.go:204] "controller-runtime/builder: skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called" GVK="/v1, Kind=Pod"
I0709 22:54:14.706932       1 main.go:221] "setup: starting manager"
I0709 22:54:14.707091       1 server.go:185] "controller-runtime/metrics: Starting metrics server"
I0709 22:54:14.707102       1 server.go:50] "intel-device-plugins-manager: starting server" kind="health probe" addr="[::]:8081"
I0709 22:54:14.707168       1 server.go:191] "controller-runtime/webhook: Starting webhook server"
I0709 22:54:14.707180       1 server.go:224] "controller-runtime/metrics: Serving metrics server" bindAddress="127.0.0.1:8080" secure=false
I0709 22:54:14.707399       1 certwatcher.go:161] "controller-runtime/certwatcher: Updated current TLS certificate"
I0709 22:54:14.707496       1 server.go:242] "controller-runtime/webhook: Serving webhook server" host="" port=9443
I0709 22:54:14.707505       1 certwatcher.go:115] "controller-runtime/certwatcher: Starting certificate watcher"
I0709 22:54:15.308148       1 leaderelection.go:250] attempting to acquire leader lease openshift-operators/d1c7b6d5.intel.com...
I0709 22:54:32.789343       1 leaderelection.go:260] successfully acquired lease openshift-operators/d1c7b6d5.intel.com
I0709 22:54:32.789519       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.SgxDevicePlugin"
I0709 22:54:32.789516       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.QatDevicePlugin"
I0709 22:54:32.789559       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.GpuDevicePlugin"
I0709 22:54:32.789546       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.DaemonSet"
I0709 22:54:32.789569       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.DaemonSet"
I0709 22:54:32.789576       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.ClusterRoleBinding"
I0709 22:54:32.789583       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" source="kind source: *v1.ServiceAccount"
I0709 22:54:32.789586       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.ClusterRoleBinding"
I0709 22:54:32.789589       1 controller.go:186] "intel-device-plugins-manager: Starting Controller" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin"
I0709 22:54:32.789591       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.DaemonSet"
I0709 22:54:32.789605       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.ClusterRoleBinding"
I0709 22:54:32.789606       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" source="kind source: *v1.ServiceAccount"
I0709 22:54:32.789612       1 controller.go:178] "intel-device-plugins-manager: Starting EventSource" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" source="kind source: *v1.ServiceAccount"
I0709 22:54:32.789613       1 controller.go:186] "intel-device-plugins-manager: Starting Controller" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin"
I0709 22:54:32.789619       1 controller.go:186] "intel-device-plugins-manager: Starting Controller" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin"
I0709 22:54:32.892909       1 controller.go:220] "intel-device-plugins-manager: Starting workers" controller="sgxdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="SgxDevicePlugin" worker count=1
I0709 22:54:32.892915       1 controller.go:220] "intel-device-plugins-manager: Starting workers" controller="qatdeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="QatDevicePlugin" worker count=1
I0709 22:54:32.892913       1 controller.go:220] "intel-device-plugins-manager: Starting workers" controller="gpudeviceplugin" controllerGroup="deviceplugin.intel.com" controllerKind="GpuDevicePlugin" worker count=1
I0709 22:57:29.103904       1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="67a28586-479d-45b7-bd0c-5959c0f69dec"
I0709 22:59:49.287569       1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="1731abb7-cc6c-4660-b750-aed0587b5066"
I0709 23:00:00.138475       1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-operator-lifecycle-manager/" namespace="openshift-operator-lifecycle-manager" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:kube-system:job-controller" requestID="a5af5e6f-7add-4859-8133-a869efc1d27b"
I0709 23:00:31.879466       1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="67d379e3-dece-4bec-aff1-01aba68ee035"
I0709 23:03:49.123162       1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="a6f961ad-f8c7-41b3-8cc9-bd89f63e9b56"
I0709 23:08:32.431932       1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="7194887f-68d0-4ec5-96f2-7c09253deb67"
I0709 23:09:51.863327       1 sgx.go:228] "admission: Mutated SGX Pod" webhookGroup="" webhookKind="Pod" Pod="openshift-marketplace/" namespace="openshift-marketplace" name="" resource={"group":"","version":"v1","resource":"pods"} user="system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount" requestID="4dad39fb-890f-4ca1-95ab-d23e50002738"
veluruchaithanya commented 4 months ago

I don't remember configuring any certificate setup in the cluster

chaitanya1731 commented 4 months ago

Hi @veluruchaithanya to get more info to debug, could you please check few things?

Does your cluster has the api version admissionregistration.k8s.io/v1? also make sure apiservice is present.

[ckulkar1@jfz1r09h07 ~]$ oc api-versions | grep admissionregistration
admissionregistration.k8s.io/v1
[ckulkar1@jfz1r09h07 ~]$ oc get apiservice | grep admission
v1.admissionregistration.k8s.io               Local                                                        True        3h48m
[ckulkar1@jfz1r09h07 ~]$

From the error you posted it seems like the cluster is trying to get the MutatingWebhookConfiguration from "admissionregistration.k8s.io/v1beta1". Take a note that the api-version is v1 and not v1beta1 as it was deprecated in k8s version 1.22. Refer here.

Once that is figured, can you please retry installing Certified Intel Device Plugins Operator? and make sure that the MutatingWebhookConfiguration is present?

[ckulkar1@jfz1r09h07 ~]$ oc get MutatingWebhookConfiguration | grep sgx
msgxdeviceplugin.kb.io-mwb87            1          3m19s
sgx.mutator.webhooks.intel.com-xd5xx    1          3m19s
veluruchaithanya commented 4 months ago

Hi @chaitanya1731, I ran the commands and got the same results as you mentioned

$ oc api-versions | grep admissionregistration
admissionregistration.k8s.io/v1

$ oc get apiservice | grep admission
v1.admissionregistration.k8s.io          Local            True        62d

$ oc get MutatingWebhookConfiguration | grep sgx
msgxdeviceplugin.kb.io-pktxg                        1          4m11s
sgx.mutator.webhooks.intel.com-gxjqj                1          4m11s

I fixed the v1beta1 version issue by deploying cert-manager with latest version v1.15.1 and cert-manager pod is not failing now

$ oc get pods  -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-bd44d64d-lz4cd                1/1     Running   0          19m
cert-manager-cainjector-7dcddbd8b9-nw9wq   1/1     Running   0          19m
cert-manager-webhook-6cc8fdfd7-8m2tt       1/1     Running   0          19m

I still got error while deploying intel SGX device plugin

$ oc apply -f https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/v1.2.1/device_plugins/sgx_device_plugin.yaml
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/intel/intel-technology-enabling-for-openshift/v1.2.1/device_plugins/sgx_device_plugin.yaml": Internal error occurred: failed calling webhook "msgxdeviceplugin.kb.io": failed to call webhook: Post "https://inteldeviceplugins-webhook-service.inteldeviceplugins-system.svc:443/mutate-deviceplugin-intel-com-v1-sgxdeviceplugin?timeout=10s": service "inteldeviceplugins-webhook-service" not found
chaitanya1731 commented 4 months ago

Thanks. So the cert-manager and the api issue is resolved.. Just to make sure about the operator installation is correct, can you confirm if all the artifacts are installed in correct namespaces?for example.. Intel Device Plugins Operator in openshift-operators namespace, NFD operator and its nfd-discovery and nfd-rule yamls in openshift-nfd?

veluruchaithanya commented 4 months ago

Hi, I can confirm that all the artifacts are installed in the correct namespaces. I am displaying the output below.

$ oc get operator
NAME                                                      AGE
intel-device-plugins-operator.openshift-operators         17h
intel-device-plugins-operator.openshiftoperators          17h
local-storage-operator.openshift-local-storage            63d
lvms-operator.openshift-storage                           61d
nfd.openshift-nfd                                         2d1h
ptp-operator.openshift-ptp                                63d
sriov-network-operator.openshift-sriov-network-operator   63d

$ oc get pods -n openshift-nfd
NAME                                    READY   STATUS    RESTARTS      AGE
nfd-controller-manager-dd9cfbfc-wm7xq   2/2     Running   2 (17h ago)   2d1h
nfd-master-85ccb585b-44r97              1/1     Running   0             2d1h
nfd-worker-njz7w                        1/1     Running   0             2d1h

$ oc get pods -n openshift-operators
NAME                                                     READY   STATUS    RESTARTS      AGE
inteldeviceplugins-controller-manager-6fdccbfb9f-lv2kn   2/2     Running   1 (17h ago)   18h

$ oc get pods -n cert-manager
NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager-bd44d64d-lz4cd                1/1     Running   0          18h
cert-manager-cainjector-7dcddbd8b9-nw9wq   1/1     Running   0          18h
cert-manager-webhook-6cc8fdfd7-8m2tt       1/1     Running   0          18h

$ oc describe node <sgx_node_name> | grep sgx
                    feature.node.kubernetes.io/cpu-security.sgx.enabled=true
                    feature.node.kubernetes.io/cpu-sgx.enabled=true
                    intel.feature.node.kubernetes.io/sgx=true
chaitanya1731 commented 4 months ago

Thanks. @veluruchaithanya We attempted to replicate this issue on both 4.14.30 and 4.14.31 but were unsuccessful. The SGX plugin worked correctly according to the given instructions.. It's worth noting that we do not utilize cert-manager for OpenShift. It appears that cert-manager may be causing misconfigurations on your OpenShift cluster, potentially causing some issues in the correct installation of the Intel Device Plugins Operator. I've observed that cert-manager has its own operator for OpenShift, which is different from the one installed on your cluster. Our Intel device plugins controller manager logs align with those you've shared, indicating that the operator installation appears correct. However, due to potential misconfigurations, it isn't functioning as expected. Could you please consider re-provisioning the cluster and following the same process again (without cert-manager) to see if the issues gets resolved?

veluruchaithanya commented 4 months ago

Hi @chaitanya1731 Have you tried to replicate the issue on the single node openshift (SNO) cluster?

chaitanya1731 commented 4 months ago

Yes. We tried on SNO.

veluruchaithanya commented 4 months ago

Hi, I removed the cert-manager and intel-device-plugins-operator. After that, I checked for old webhook plugins in the api-resources "MutatingWebhookConfiguration" and "ValidatingWebhookConfiguration". They had below webhooks

MutatingWebhookConfiguration

NAME                                                  WEBHOOKS   AGE
inteldeviceplugins-mutating-webhook-configuration   9          16d
inteldeviceplugins-webhook                          1          15d
mutating-webhook-configuration                      9          16d

ValidatingWebhookConfiguration

NAME                                                  WEBHOOKS   AGE
inteldeviceplugins-validating-webhook-configuration   7          16d
validating-webhook-configuration                      7          16d

After removing above webhooks, I tried installing intel-device-plugins-operator and intel SGX device plugin. The SGX plugin has been successfully installed this time.

$ oc get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
intel-sgx-plugin-tzrqw                                 1/1     Running   0          7m13s
inteldeviceplugins-controller-manager-878cd6cc-f7kj8   2/2     Running   0          8m1s

$ oc get SgxDevicePlugin
NAME                     DESIRED   READY   NODE SELECTOR                                     AGE
sgxdeviceplugin-sample   1         1       {"intel.feature.node.kubernetes.io/sgx":"true"}   7m18s

@chaitanya1731 @vbedida79 Thanks for help.

chaitanya1731 commented 4 months ago

Looks like it was some misconfiguration due to cert-manager. Thanks for confirming and happy to help. Closing this now.

chaitanya1731 commented 4 months ago

resolved

mythi commented 4 months ago

Could you please consider re-provisioning the cluster and following the same process again (without cert-manager) to see if the issues gets resolved?

The device plugins "upstream" docs refer to cert-manager usage to get webhooks' TLS certs provisioned. However, that is not applicable to OpenShift because OLM takes care of that.

It sounds there is a potential error with cert-manager + OLM co-existence when setting up the Device Plugins operator.