Failed to open Intel SGX device

intel / trusted-certificate-issuer

Trusted Certificate Service (TCS) is a K8s service to protect signing keys using Intel's SGX technology. K8s CSR and cert-manager CR APIs are both supported. TCS also contains integration samples for Istio service mesh and Key Management Reference Application (KMRA).

Apache License 2.0

29 stars 15 forks source link

Failed to open Intel SGX device #38

Closed astronaut0131 closed 2 years ago

astronaut0131 commented 2 years ago

I'm trying to deploy tcs-issuer in k8s cluster, but got the following error:

$ kubectl logs tcs-controller-79c499fb98-v5kv8
2022-07-28T14:13:05.734Z        INFO    controller-runtime.metrics      metrics server is starting to listen    {"addr": ":8082"}
[get_driver_type edmm_utility.cpp:111] Failed to open Intel SGX device.
[get_driver_type /home/sgx/jenkins/ubuntuServer2004-release-build-trunk-215/build_target/PROD/label/Builder-UbuntuSrv20/label_exp/ubuntu64/linux-trunk-opensource/psw/urts/linux/edmm_utility.cpp:111] Failed to open Intel SGX device.
2022-07-28T14:13:05.929Z        LEVEL(-2)       SGX     Failed to configure command
2022-07-28T14:13:05.929Z        ERROR   setup   SGX initialization      {"error": "failed to initialize PKCS#11 library: pkcs11: 0x30: CKR_DEVICE_ERROR", "errorVerbose": "pkcs11: 0x30: CKR_DEVICE_ERROR\nfailed to initialize PKCS#11 library"}

I think all the prerequisites are working correctlly.

$ kubectl describe node zhenhui-control-plane | grep sgx.intel
                    sgx.intel.com/capable=true
                    nfd.node.kubernetes.io/extended-resources: sgx.intel.com/epc
  sgx.intel.com/enclave:    110
  sgx.intel.com/epc:        4261412864
  sgx.intel.com/provision:  110
  sgx.intel.com/enclave:    110
  sgx.intel.com/epc:        4261412864
  sgx.intel.com/provision:  110
  sgx.intel.com/enclave    1           1
  sgx.intel.com/epc        512Ki       512Ki
  sgx.intel.com/provision  0           0

$ ~/zhenhui/intel-device-plugins-for-kubernetes# sudo service aesmd status
● aesmd.service - Intel(R) Architectural Enclave Service Manager
     Loaded: loaded (/lib/systemd/system/aesmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-07-28 17:50:07 CST; 4h 54min ago
   Main PID: 2580841 (aesm_service)
      Tasks: 4 (limit: 304204)
     Memory: 2.3M
     CGroup: /system.slice/aesmd.service
             └─2580841 /opt/intel/sgx-aesm-service/aesm/aesm_service

7月 28 17:50:07 i10 systemd[1]: Starting Intel(R) Architectural Enclave Service Manager...
7月 28 17:50:07 i10 usermod[2580804]: add 'aesmd' to group 'sgx_prv'
7月 28 17:50:07 i10 usermod[2580804]: add 'aesmd' to shadow group 'sgx_prv'
7月 28 17:50:07 i10 aesm_service[2580834]: aesm_service: warning: Turn to daemon. Use "--no-daemon" option to execute i>
7月 28 17:50:07 i10 systemd[1]: Started Intel(R) Architectural Enclave Service Manager.
7月 28 17:50:07 i10 aesm_service[2580841]: The server sock is 0x5652f0dfe400

poussa commented 2 years ago

@astronaut0131 thanks for you report.

Can you list the PODs you have running, especially Intel k8s device plugin and NFD (+ their versions).

Are you using in-tree SGX driver? What k8s version, which host OS?

Can you provide ls -l /dev/sgx* on the host?

astronaut0131 commented 2 years ago

@astronaut0131 thanks for you report.

Can you list the PODs you have running, especially Intel k8s device plugin and NFD (+ their versions).

Are you using in-tree SGX driver? What k8s version, which host OS?

Can you provide ls -l /dev/sgx* on the host?

All related plugins are in latest version, I follow the instructions in https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/sgx_plugin/README.md#deploying-as-a-daemonset

sgx plugin status

$ kubectl get SgxDevicePlugin
NAME                     DESIRED   READY   NODE SELECTOR                                     AGE
sgxdeviceplugin-sample   1         1       {"intel.feature.node.kubernetes.io/sgx":"true"}   3m31s

node info

kubectl describe node zhenhui-control-plane | grep sgx.intel.com
                    sgx.intel.com/capable=true
                    nfd.node.kubernetes.io/extended-resources: sgx.intel.com/epc
  sgx.intel.com/enclave:    110
  sgx.intel.com/epc:        4261412864
  sgx.intel.com/provision:  110
  sgx.intel.com/enclave:    110
  sgx.intel.com/epc:        4261412864
  sgx.intel.com/provision:  110
  sgx.intel.com/enclave    1           1
  sgx.intel.com/epc        512Ki       512Ki
  sgx.intel.com/provision  0           0

host os

$ uname -r
5.13.0-41-generic

$ ls -l /dev/sgx*
crw-rw-rw- 1 root root    10, 125 7月  28 17:50 /dev/sgx_enclave
crw-rw---- 1 root sgx_prv 10, 126 7月  28 17:50 /dev/sgx_provision

/dev/sgx:
total 0
lrwxrwxrwx 1 root root 14 7月  28 17:50 enclave -> ../sgx_enclave
lrwxrwxrwx 1 root root 16 7月  28 17:50 provision -> ../sgx_provision

k8s version

kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:29:09Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}

BTW, I can successfully boot an enclave directly on the host os, so I think the hardware device is working correctly, the problem only occurs in a container. I suspect the problem has something to do with Kind, I'm using Kind to build a cluster.

poussa commented 2 years ago

@astronaut0131 did you get your issue resolved, or not?

astronaut0131 commented 2 years ago

@poussa Not yet, I've tried to use a real k8s cluster instead of kind, but the same error still exists.

astronaut0131 commented 2 years ago

@poussa @avalluri I finally found that the problem is related to SDK version in tcs-issuer Dockerfile

# git diff Dockerfile
diff --git a/Dockerfile b/Dockerfile
index 77d681f..0c8c43e 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -140,8 +140,6 @@ RUN mkdir -p /usr/local/share/package-licenses \
 ###
 FROM ubuntu:focal as runtime

-ARG SDK_VERSION="2.15.100.3"
-ARG DCAP_VERSION="1.12.100.3"

 RUN apt-get update \
   && apt-get install -y wget gnupg \
@@ -152,16 +150,16 @@ RUN apt-get update \
   && apt-get remove -y wget gnupg && apt-get autoremove -y \
   && bash -c 'set -o pipefail; apt-get install --no-install-recommends -y \
     libprotobuf17 \
-    libsgx-enclave-common=${SDK_VERSION}-focal1 \
-    libsgx-epid=${SDK_VERSION}-focal1 \
-    libsgx-quote-ex=${SDK_VERSION}-focal1 \
-    libsgx-urts=${SDK_VERSION}-focal1 \
-    libsgx-ae-epid=${SDK_VERSION}-focal1 \
-    libsgx-ae-qe3=${DCAP_VERSION}-focal1 \
-    libsgx-dcap-ql=${DCAP_VERSION}-focal1 \
-    libsgx-pce-logic=${DCAP_VERSION}-focal1 \
-    libsgx-qe3-logic=${DCAP_VERSION}-focal1 \
-    libsgx-dcap-default-qpl=${DCAP_VERSION}-focal1 \
+    libsgx-enclave-common \
+    libsgx-epid \
+    libsgx-quote-ex \
+    libsgx-urts \
+    libsgx-ae-epid \
+    libsgx-ae-qe3 \
+    libsgx-dcap-ql \
+    libsgx-pce-logic \

I changed it like this and the error is gone. Looks like the origin version of SDK has problem opening /dev/sgx_enclave, would you consider changing the version here?

avalluri commented 2 years ago

@astronaut0131 Good to hear that you could figure out the issue. I guess you are using v1.24 intel-device-plugins which dropped the support for creating /dev/sgx_* device links that are used by the <=v2.15 SGX SDK.

Dependency upgrades are in the plan. Will be part of next release.

avalluri commented 2 years ago

@astronaut0131 This PR updates to the latest SDK and is supposed to fix your issue. If possible can you give it a try.

astronaut0131 commented 2 years ago

@avalluri Sorry for the late reply, I'm out of office last week, the latest version gives the following error:

$ kubectl logs tcs-controller-6b64fcd89-fk76q -n tcs-issuer
Defaulted container "tcs-issuer" out of: tcs-issuer, init (init)
flag provided but not defined: -use-random-nonce
Usage of /tcs-issuer:
  -cert-manager-issuer
        Run it as issuer for cert-manager. (default true)
  -csr-full-cert-chain
        Return full certificate chain in Kubernetes CSR certificate.
  -health-probe-bind-address string
        The address the probe endpoint binds to. (default ":8081")
  -key-wrap-mechanism string
        CA private key wrapping mechanism to use. One of: 'aes_gcm' or 'ads_key_pad_wrap'  (default "aes_key_wrap_pad")
  -kubeconfig string
        Paths to a kubeconfig. Only required if out-of-cluster.
  -leader-elect
        Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
  -log-flush-frequency duration
        Maximum number of seconds between log flushes (default 5s)
  -metrics-bind-address string
        The address the metric endpoint binds to. (default ":8080")
  -so-pin string
        PKCS11 token so/admin pin.
  -token-label string
        PKCS11 label to use for the operator token. (default "SgxOperator")
  -user-pin string
        PKCS11 token user pin.
  -zap-devel
        Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error)
  -zap-encoder value
        Zap log encoding (one of 'json' or 'console')
  -zap-log-level value
        Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
  -zap-stacktrace-level value
        Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
  -zap-time-encoding value
        Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

Looks like this error is related to https://github.com/intel/trusted-certificate-issuer/commit/570560f3cd655e57dee1716731bc3c67ebad688f, maybe you forget to change the yaml configurations accordingly?

avalluri commented 2 years ago

@astronaut0131 Thanks for trying this out.

Looks like this error is related to https://github.com/intel/trusted-certificate-issuer/commit/570560f3cd655e57dee1716731bc3c67ebad688f, maybe you forget to change the yaml configurations accordingly?

The commit you mentioned removed the user-random-nonce argument, which was not intentional. Hence this error. Now I fixed this in #62. Can you please try either cherry-picking the commit(s) or removing the argument in your deployment?

astronaut0131 commented 2 years ago

@astronaut0131 Thanks for trying this out.

Looks like this error is related to 570560f, maybe you forget to change the yaml configurations accordingly?

The commit you mentioned removed the user-random-nonce argument, which was not intentional. Hence this error. Now I fixed this in #62. Can you please try either cherry-picking the commit(s) or removing the argument in your deployment?

I've tried and it works well. I also find some tiny problems, which I make a pull request https://github.com/intel/trusted-certificate-issuer/pull/63.