intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
31 stars 202 forks source link

[QAT] No devices found in container without privilege #1700

Open Kewei-Lu opened 5 months ago

Kewei-Lu commented 5 months ago

Describe the bug Some processes report "No devices found" during running openssl speed command when privilege is not set to container

To Reproduce

  1. Build the docker image based on demo/openssl-qat-engine/Dockerfile

  2. Deploy qat device-plugin and check resources are available

    $ kubectl describe node node1
    Allocatable:
    cpu:                128
    ephemeral-storage:  67612704657
    hugepages-1Gi:      0
    hugepages-2Mi:      0
    memory:             263621052Ki
    pods:               110
    qat.intel.com/cy:   128
  3. Deploy openssl-qat-engine with below manifest

    kind: Pod
    apiVersion: v1
    metadata:
    name: openssl-qat-engine
    spec:
    containers:
    - name: openssl-qat-engine
    image: [My local registry]/qat-engine:latest
    imagePullPolicy: Always
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        qat.intel.com/cy: '16'
      limits:
        qat.intel.com/cy: '16'
    securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          add:
          - NET_BIND_SERVICE
          - IPC_LOCK
        readOnlyRootFilesystem: false
  4. Login to the container and verify using openssl

    
    $ kubectl exec -it openssl-qat-engine bash
    $ openssl engine -c -t -v qatengine
    (qatengine) Reference implementation of QAT crypto engine(qat_hw & qat_sw) v1.4.0
    [RSA, AES-128-CBC-HMAC-SHA256, AES-256-CBC-HMAC-SHA256, ChaCha20-Poly1305, id-aes128-GCM, id-aes192-GCM, id-aes256-GCM, SHA3-256, SHA3-384, SHA3-512, TLS1-PRF, X25519, X448, SM2]
     [ available ]
     ENABLE_EXTERNAL_POLLING, POLL, SET_INSTANCE_FOR_THREAD,
     GET_NUM_OP_RETRIES, SET_MAX_RETRY_COUNT, SET_INTERNAL_POLL_INTERVAL,
     GET_EXTERNAL_POLLING_FD, ENABLE_EVENT_DRIVEN_POLLING_MODE,
     GET_NUM_CRYPTO_INSTANCES, DISABLE_EVENT_DRIVEN_POLLING_MODE,
     SET_EPOLL_TIMEOUT, SET_CRYPTO_SMALL_PACKET_OFFLOAD_THRESHOLD,
     ENABLE_INLINE_POLLING, ENABLE_HEURISTIC_POLLING,
     GET_NUM_REQUESTS_IN_FLIGHT, INIT_ENGINE, SET_CONFIGURATION_SECTION_NAME,
     ENABLE_SW_FALLBACK, HEARTBEAT_POLL, DISABLE_QAT_OFFLOAD, HW_ALGO_BITMAP,
     SW_ALGO_BITMAP
    80FBC9006E7F0000:error:1280006A:DSO support routines:dlfcn_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_dlfcn.c:188:symname(EVP_PKEY_base_id): /usr/lib/x86_64-linux-gnu/engines-3/qatengine.so: undefined symbol: EVP_PKEY_base_id
    80FBC9006E7F0000:error:1280006A:DSO support routines:DSO_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_lib.c:176:

This works fine

$ openssl speed -engine qatengine -elapsed -async_jobs 8 rsa2048 Engine "qatengine" set. You have chosen to measure elapsed time instead of user CPU time. Doing 2048 bits private rsa's for 10s: 199083 2048 bits private RSA's in 10.00s Doing 2048 bits public rsa's for 10s: 959389 2048 bits public RSA's in 10.00s version: 3.0.2 built on: Fri Feb 16 08:51:30 2024 UTC options: bn(64,64) compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-olCZw9/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2 CPUINFO: OPENSSL_ia32cap=0x7ffef3ffffebffff:0xfb417ffef3bfb7ef sign verify sign/s verify/s rsa 2048 bits 0.000050s 0.000010s 19908.3 95938.9

When setting multiple processes, the error pop up

$ openssl speed -engine qatengine -elapsed -async_jobs 8 -multi 8 rsa2048

Forked child 0 Forked child 1 Forked child 2 Forked child 3 Forked child 4 Forked child 5 Forked child 6 Forked child 7 No devices found No devices found No devices found No device found No device found No device found Engine "qatengine" set. Engine "qatengine" set. Engine "qatengine" set. +DTP:2048:private:rsa:10 +DTP:2048:private:rsa:10 +DTP:2048:private:rsa:10 No devices found No device found ... Got: +F2:2:2048:6680.100000:83645.800000 from 0 Got: +F2:2:2048:6630.500000:83514.600000 from 1 Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 2 Got: +F2:2:2048:11644.800000:144419.200000 from 2 Got: +F2:2:2048:6952.300000:84143.700000 from 3 Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 4 Got: +F2:2:2048:11627.200000:142830.069930 from 4 Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 5 Got: +F2:2:2048:11668.800000:143320.000000 from 5 Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 6 Got: +F2:2:2048:11376.800000:127321.378621 from 6 Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 7 Got: +F2:2:2048:11647.200000:142272.027972 from 7

The result seems also get boosted somehow, but not sure via qat_sw or qat_hw

compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-olCZw9/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2 CPUINFO: OPENSSL_ia32cap=0x7ffef3ffffebffff:0xfb417ffef3bfb7ef sign verify sign/s verify/s rsa 2048 bits 0.000013s 0.000001s 78227.7 951466.8



**Expected behavior**
All 8 processes should be able to use qatengine as 16 instances are passthrough

**Screenshots**
![image](https://github.com/intel/intel-device-plugins-for-kubernetes/assets/102018874/362ad80b-1f86-46a7-a6a6-582884d11dce)
![image](https://github.com/intel/intel-device-plugins-for-kubernetes/assets/102018874/3478a463-2a79-4ced-bda9-76ddd83e1fb2)

**System (please complete the following information):**
 - OS version: CentOS Stream release 8
 - Kernel version: 6.8.1-1.el8.elrepo.x86_64
 - Device plugins version: v0.29.0
 - Hardware info: CPU: 6454S + intree driver

**Additional context**

As you can see from the screenshot, not all processes fail to fetch the qat handler, which makes me curious.

What makes the problem more tricky is that if I add `privileged: true` to pod manifest, everything works fine (i.e., I can create 16 processes when running `openssl speed` without error info) but I think that may not be used in production env.
![image](https://github.com/intel/intel-device-plugins-for-kubernetes/assets/102018874/af2ceac6-572c-4548-91d4-73d8d3f28c5f)
![image](https://github.com/intel/intel-device-plugins-for-kubernetes/assets/102018874/6f8b49a4-d4e6-40b3-b6f3-7950cf13d301)
mythi commented 5 months ago
$ openssl speed -engine qatengine -elapsed -async_jobs 8 rsa2048

Can you try with 4 jobs? It could be that the qatlib allocation limitation triggers the problem you're seeing. Try settin QAT_POLICY=1 environment variable. If that helps, we'll need to update our docs a bit.

Ref: https://github.com/intel/qatlib/blob/ec817626e7de237b24cfb91b7cad076902df603a/INSTALL#L519-L522

Kewei-Lu commented 5 months ago

Nice catch! Will not see the error if adding that ENV in container. Really appreciate :)