intel / asynch_mode_nginx

Other
210 stars 60 forks source link

QAT Engine failed: HEARTBEAT_POLL #72

Closed kkurzacz-intel closed 1 month ago

kkurzacz-intel commented 6 months ago

What is the problem

Async nginx with QAT configuration starts but is constantly logging the error:

[alert] 25453#0: QAT Engine failed: HEARTBEAT_POLL

System description

QAT configuration

[GENERAL]
ServicesEnabled = asym;dc

ConfigVersion = 2

#Default values for number of concurrent requests*/
CyNumConcurrentSymRequests = 512
CyNumConcurrentAsymRequests = 64

#Statistics, valid values: 1,0
statsGeneral = 1
statsDh = 1
statsDrbg = 1
statsDsa = 1
statsEcc = 1
statsKeyGen = 1
statsDc = 1
statsLn = 1
statsPrime = 1
statsRsa = 1
statsSym = 1

# Default heartbeat timer is 1s
HeartbeatTimer = 1000

# This flag is to enable SSF features
StorageEnabled = 0

# Disable public key crypto and prime number
# services by specifying a value of 1 (default is 0)
PkeServiceDisabled = 0

# This flag is to enable device auto reset on heartbeat error
AutoResetOnError = 0

# Default value for power management idle interrupt delay
PmIdleInterruptDelay = 512

# This flag is to enable power management idle support
PmIdleSupport = 1

# This flag is to enable key protection technology
KptEnabled = 1

# Define the maximum SWK count per function can have
# Default value is 1, the maximum value is 128
KptMaxSWKPerFn = 1

# Define the maximum SWK count per pasid can have
# Default value is 1, the maximum value is 128
KptMaxSWKPerPASID = 1

# Define the maximum SWK lifetime in second
# Default value is 0 (eternal of life)
# The maximum value is 31536000 (one year)
KptMaxSWKLifetime = 31536000

# Flag to define whether to allow SWK to be shared among processes
# Default value is 0 (shared mode is off)
KptSWKShared = 0

# Disable AT
ATEnabled = 0
##############################################
# Kernel Instances Section
##############################################
[KERNEL]
NumberCyInstances = 0
NumberDcInstances = 0

##############################################
# ADI Section for Scalable IOV
##############################################
[SIOV]
NumberAdis = 0

##############################################
# User Process Instance Section
##############################################
[SHIM]
NumberCyInstances = 1
NumberDcInstances = 1
NumProcesses = 32
LimitDevAccess = 1

# Crypto - User instance #0
Cy0Name = "UserCY0"
Cy0IsPolled = 1
# List of core affinities
Cy0CoreAffinity = 0

# Crypto - Data compression instance #0
Dc0Name = "UserDC0"
Dc0IsPolled = 1
# List of core affinities
Dc0CoreAffinity = 0

# Crypto - User instance #1
Cy1Name = "UserCY1"
Cy1IsPolled = 1
# List of core affinities
Cy1CoreAffinity = 1

# Crypto - User instance #2
Cy2Name = "UserCY2"
Cy2IsPolled = 1
# List of core affinities
Cy2CoreAffinity = 2

# Crypto - User instance #3
Cy3Name = "UserCY3"
Cy3IsPolled = 1
# List of core affinities
Cy3CoreAffinity = 3

Nginx configuration

worker_processes  224;
# TODO: possibly change workers to non-root
# This setting was made because otherwise `nobody` is worker owner
# and nginx cannot access html file due to lack of access to intermediate directory
# [~] Following line is to adjust settings from repository
user  root root;
worker_rlimit_nofile 32000;

load_module modules/ngx_http_qatzip_filter_module.so;
load_module modules/ngx_ssl_engine_qat_module.so;

events {
    use epoll;
    worker_connections 102400;
    accept_mutex off;
}

# Enable QAT engine in heretic mode.
ssl_engine {
    use_engine qatengine;
    default_algorithms RSA,EC,DH,DSA;
    qat_engine {
        qat_offload_mode async;
        qat_notify_mode poll;
        qat_poll_mode heuristic;
        qat_sw_fallback on;
    }
}

http {
    gzip on;
    gzip_min_length     128;
    gzip_comp_level     1;
    gzip_types  text/css text/javascript text/xml text/plain text/x-component application/javascript application/json application/xml application/rss+xml font/truetype font/opentype application/vnd.ms-fontobject image/svg+xml;
    gzip_vary            on;
    gzip_disable        "msie6";
    gzip_http_version   1.0;

    qatzip_sw failover;
    qatzip_min_length 128;
    qatzip_comp_level 1;
    qatzip_buffers 16 8k;
    qatzip_types text/css text/javascript text/xml text/plain text/x-component application/javascript application/json application/xml application/rss+xml font/truetype font/opentype application/vnd.ms-fontobject image/svg+xml application/octet-stream image/jpeg;
    qatzip_chunk_size   64k;
    qatzip_stream_size  256k;
    qatzip_sw_threshold 256;

    # HTTP server with QATZip enabled.
    server {
        listen       80;
        server_name  localhost;
        location / {
            root   html;
            index  index.html index.htm;
        }
    }

    # HTTPS server with async mode.
    server {
        #If QAT Engine enabled,  `asynch` need to add to `listen` directive or just add `ssl_asynch  on;` to the context.
        listen       443 ssl asynch;
        server_name  localhost;

        ssl_protocols       TLSv1.2;
        ssl_certificate      cert.pem;
        ssl_certificate_key  cert.key;

        location / {
            root   html;
            index  index.html index.htm;
        }
    }
}

What is working

Openssl is sort of working with QAT:

$ openssl engine -t -c -v qatengine
(qatengine) Reference implementation of QAT crypto engine(qat_hw & qat_sw) v1.4.0
 [RSA, AES-128-CBC-HMAC-SHA256, AES-256-CBC-HMAC-SHA256, id-aes128-GCM, id-aes192-GCM, id-aes256-GCM, TLS1-PRF, X25519, X448, SM2]
     [ available ]
     ENABLE_EXTERNAL_POLLING, POLL, SET_INSTANCE_FOR_THREAD,
     GET_NUM_OP_RETRIES, SET_MAX_RETRY_COUNT, SET_INTERNAL_POLL_INTERVAL,
     GET_EXTERNAL_POLLING_FD, ENABLE_EVENT_DRIVEN_POLLING_MODE,
     GET_NUM_CRYPTO_INSTANCES, DISABLE_EVENT_DRIVEN_POLLING_MODE,
     SET_EPOLL_TIMEOUT, SET_CRYPTO_SMALL_PACKET_OFFLOAD_THRESHOLD,
     ENABLE_INLINE_POLLING, ENABLE_HEURISTIC_POLLING,
     GET_NUM_REQUESTS_IN_FLIGHT, INIT_ENGINE, SET_CONFIGURATION_SECTION_NAME,
     ENABLE_SW_FALLBACK, HEARTBEAT_POLL, DISABLE_QAT_OFFLOAD, HW_ALGO_BITMAP,
     SW_ALGO_BITMAP
803B6038F27F0000:error:1280006A:DSO support routines:dlfcn_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_dlfcn.c:188:symname(EVP_PKEY_base_id): /usr/local/ssl/lib64/engines-3/qatengine.so: undefined symbol: EVP_PKEY_base_id
803B6038F27F0000:error:1280006A:DSO support routines:DSO_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_lib.c:176:
$ openssl speed -engine qatengine -elapsed -async_jobs 72 rsa2048
Engine "qatengine" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 311671 2048 bits private RSA's in 10.01s
Doing 2048 bits public rsa's for 10s: 3722703 2048 bits public RSA's in 10.00s
version: 3.0.2
built on: Fri Oct 13 12:02:49 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-8L8jlV/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0x7ffef3ffffebffff:0xfb417ffef3bfb7ef
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000032s 0.000003s  31136.0 372270.3

Additional information

I need to to run export OPENSSL_ENGINES=/usr/local/ssl/lib64/engines-3 otherwise openssl can't find the engine

kkurzacz-intel commented 6 months ago

I was told that software fallback with heartbeat is not supported in QAT2.0 driver v. 20.L.1.0.50-00003. So I should turn off qat_sw_fallback in the nginx.conf.

So I changed that entry:

        qat_sw_fallback off;

I did that, however there are still errors. When I start nginx it's okay. But once I run first request, I start getting following errors, even after cancelling request:

QAT Engine failed: POLL
kkurzacz-intel commented 6 months ago

Looks like we identified the issue reason. It happened because workers number was much greater than number of available HW QAT instances. As long as I understand, when HW QAT instances pool is exhausted, rest or nginx workers should receive SW QAT ones. But for some reason, it doesn't happen.

I have enabled QAT verbose debug logs, by adding --enable-qat_debug to ./configure of QAT engine. Therefore, QAT was logging everything to error.log file of nginx. And we were able to spot lines that explains the POLL error:

[WARN][2332072.319774] PID [179324] Thread [7f0ae03ad740][e_qat.c:742:qat_engine_ctrl()] POLL failed as no instances are available
2024/01/04 16:17:42 [alert] 179324#0: QAT Engine failed: POLL

Temporary solution for now is to lower number of workers (worker_processes in nginx.conf) to the number matching QAT HW instances. In my example, QAT driver conf (_/etc/4xxxdev0.conf), SHIM section has following lines:

[SHIM]
NumberCyInstances = 1
NumberDcInstances = 1
NumProcesses = 32
LimitDevAccess = 1

NumberCyInstances x NumProcesses = 32. On 2 socket instance with 2 CPUs and QAT modules, we have 32 x 2 = 64. So 64 is maximum number of workers, for which there are enough HW QAT instances.

Yogaraj-Alamenda commented 5 months ago

@kkurzacz-intel The issue is with qatengine when run with external polling where it is trying to poll an instance for heartbeat for the worker process that does not have qat_hw instance which should do qat_sw polling only. We will fix it in the qatengine.

That being said, in addition to the workaround you have mentioned, here is 2 other alternatives.

  1. For the external polling mode, set _qat_swfallback to off or remove the qat_sw_fallback parameter om nginx.conf (as it off by default). This turns off Heartbeat polling and disables fallback on device failure but still be able to fallback to qat_sw when there is no instance.
  2. Set polling mode to internal (_qat_pollmode internal;) in nginx.conf which is handled by engine by creating internal polling thread for the available instances for all the workers. This way you dont have to reduce number of workers.

Please let us know if that works

Yogaraj-Alamenda commented 1 month ago

The issue mentioned here is closed with the commit below in QAT Engine and relased in QAT Engine v1.6.0 https://github.com/intel/QAT_Engine/commit/3a1fca3138c96054721bebe19861b0cd6dc449af. Hence closing this