intel / QAT_Engine

Intel QuickAssist Technology( QAT) OpenSSL Engine (an OpenSSL Plug-In Engine) which provides cryptographic acceleration for both hardware and optimized software using Intel QuickAssist Technology enabled Intel platforms. https://developer.intel.com/quickassist
BSD 3-Clause "New" or "Revised" License
410 stars 128 forks source link

Nginx worker CPU 100%, QAT_Engine code deadloop #211

Open fuchencong opened 2 years ago

fuchencong commented 2 years ago

Environment Information

  1. Nginx version
nginx version: nginx/1.17.3
built by gcc 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
built with OpenSSL 1.1.1c  28 May 2019
TLS SNI support enabled
  1. OpenSSL 1.1.1c
  2. QAT_ENGINE_VERSION: 0.5.42
  3. QAT_DRIVER_VERSION 1.7.l.4.6.0-00025

Issue

Nginx worker CPU usage 100%

 pidstat -u -p 7137 1
Linux 4.15.0-140-generic (chaitin-safeline)     08/15/22    _x86_64_    (8 CPU)

11:08:43      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:08:44        0      7137   24.00   75.00    0.00    0.00   99.00     2  nginx
11:08:45        0      7137   23.00   78.00    0.00    0.00  100.00     2  nginx
11:08:46        0      7137   27.00   73.00    0.00    0.00  100.00     2  nginx
11:08:47        0      7137   27.00   73.00    0.00    0.00  100.00     2  nginx
11:08:48        0      7137   29.00   71.00    0.00    0.00  100.00     2  nginx
11:08:49        0      7137   24.00   76.00    0.00    0.00  100.00     2  nginx
11:08:50        0      7137   25.00   75.00    0.00    0.00  100.00     2  nginx
11:08:51        0      7137   19.00   81.00    0.00    0.00  100.00     2  nginx
11:08:52        0      7137   27.00   73.00    0.00    0.00  100.00     2  nginx
11:08:53        0      7137   25.00   75.00    0.00    0.00  100.00     2  nginx
11:08:54        0      7137   31.00   69.00    0.00    0.00  100.00     2  nginx
11:08:55        0      7137   22.00   77.00    0.00    0.00   99.00     1  nginx
11:08:56        0      7137   30.00   71.00    0.00    0.00  100.00     1  nginx
11:08:57        0      7137   26.00   74.00    0.00    0.00  100.00     1  nginx
11:08:58        0      7137   26.00   74.00    0.00    0.00  100.00     1  nginx

What I found

After gdb the process, I got stack as below:

(gdb) bt
#0  0x00007f1e3a0e0ef7 in sched_yield () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1e39db48dc in qat_prf_tls_derive (ctx=0x7f1e1802fe70, key=0x7f1e18cc5560 "", olen=0x7ffc1e6961e0) at qat_prf.c:802
#2  0x00007f1e3ad7913b in tls1_setup_key_block () from target:/opt/ssl/lib/libssl.so.1.1
#3  0x00007f1e3ad4228b in ssl3_do_change_cipher_spec () from target:/opt/ssl/lib/libssl.so.1.1
#4  0x00007f1e3ad6eea9 in tls_process_change_cipher_spec () from target:/opt/ssl/lib/libssl.so.1.1
#5  0x00007f1e3ad6b655 in ossl_statem_client_process_message () from target:/opt/ssl/lib/libssl.so.1.1
#6  0x00007f1e3ad6519b in state_machine () from target:/opt/ssl/lib/libssl.so.1.1
#7  0x00007f1e3ad4f424 in SSL_do_handshake () from target:/opt/ssl/lib/libssl.so.1.1
#8  0x000056133748cc18 in ngx_ssl_handshake ()
#9  0x000056133748d0e8 in ?? ()
#10 0x000056133748786d in ?? ()
#11 0x000056133747bfb8 in ngx_process_events_and_timers ()
#12 0x0000561337485e08 in ?? ()
#13 0x0000561337481d72 in ngx_spawn_process ()
#14 0x0000561337486c3b in ngx_master_process_cycle ()
#15 0x0000561337459d66 in main ()
(gdb) info locals
ret = 0
job_ret = 0
prf_op_data = {tlsOp = CPA_CY_KEY_TLS_OP_KEY_MATERIAL_DERIVE, secret = {dataLenInBytes = 48,
    pData = 0x7f1e3726c440 <error: Cannot access memory at address 0x7f1e3726c440>}, seed = {dataLenInBytes = 64,
    pData = 0x7f1e3726c900 <error: Cannot access memory at address 0x7f1e3726c900>}, generatedKeyLenInBytes = 88, userLabel = {dataLenInBytes = 0,
    pData = 0x0}}
generated_key = 0x7f1e3726c540
status = <optimized out>
qat_prf_ctx = 0x7f1e2619ac80
hash_algo = CPA_CY_SYM_HASH_SHA384
key_length = 88
op_done = {flag = 0, verifyResult = CPA_FALSE, job = 0x0, status = -1}
qatPerformOpRetries = <optimized out>
iMsgRetry = 5
ulPollInterval = 10000
inst_num = <optimized out>
tlv = 0x7f1e247deed8
fallback = 0
sw_derive_fn_ptr = 0x0

(gdb) info args
ctx = 0x7f1e1802fe70
key = 0x7f1e18cc5560 ""
olen = 0x7ffc1e6961e0

I check the code of function qat_prf_tls_derive:

    do {
        if(op_done.job != NULL) {
            /* If we get a failure on qat_pause_job then we will
               not flag an error here and quit because we have
               an asynchronous request in flight.
               We don't want to start cleaning up data
               structures that are still being used. If
               qat_pause_job fails we will just yield and
               loop around and try again until the request
               completes and we can continue. */
            if ((job_ret = qat_pause_job(op_done.job, ASYNC_STATUS_OK)) == 0)
                pthread_yield();
        } else {
            pthread_yield();
        }
    }
    while (!op_done.flag ||
           QAT_CHK_JOB_RESUMED_UNEXPECTEDLY(job_ret));

From dmesg, nginx used to encounter memory allocation failure problem. Unable to determine if it is related to the infinite loop issue.

[Thu Aug 11 16:53:41 2022] nginx: page allocation failure: order:9, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)
[Thu Aug 11 16:53:41 2022] nginx cpuset=ad2dd6b01a97813632775013d27d7fba19aaa8567bb7787a3a4abe2499726d5e mems_allowed=0
[Thu Aug 11 16:53:41 2022] CPU: 4 PID: 4784 Comm: nginx Tainted: G        W  OE    4.15.0-140-generic #144-Ubuntu
[Thu Aug 11 16:53:41 2022] Hardware name: EVOC NET-1825VDN/NET-1825VDN, BIOS 5.12 05/14/2021
[Thu Aug 11 16:53:41 2022] Call Trace:
[Thu Aug 11 16:53:41 2022]  dump_stack+0x6d/0x8b
[Thu Aug 11 16:53:41 2022]  warn_alloc+0xff/0x1a0
[Thu Aug 11 16:53:41 2022]  ? __alloc_pages_direct_compact+0x51/0x100
[Thu Aug 11 16:53:41 2022]  __alloc_pages_slowpath+0xdc5/0xe00
[Thu Aug 11 16:53:41 2022]  __alloc_pages_nodemask+0x29a/0x2c0
[Thu Aug 11 16:53:41 2022]  kmalloc_large_node+0x3b/0x60
[Thu Aug 11 16:53:41 2022]  __kmalloc_node+0x21e/0x2b0
[Thu Aug 11 16:53:41 2022]  mem_ioctl+0x585/0xab0 [usdm_drv]
[Thu Aug 11 16:53:41 2022]  do_vfs_ioctl+0xa8/0x630
[Thu Aug 11 16:53:41 2022]  ? putname+0x4c/0x60
[Thu Aug 11 16:53:41 2022]  ? do_sys_open+0x13d/0x2c0
[Thu Aug 11 16:53:41 2022]  SyS_ioctl+0x79/0x90
[Thu Aug 11 16:53:41 2022]  do_syscall_64+0x73/0x130
[Thu Aug 11 16:53:41 2022]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[Thu Aug 11 16:53:41 2022] RIP: 0033:0x7f1e3a0f3317
[Thu Aug 11 16:53:41 2022] RSP: 002b:00007ffc1e695bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Thu Aug 11 16:53:41 2022] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f1e3a0f3317
[Thu Aug 11 16:53:41 2022] RDX: 00007ffc1e695be0 RSI: 00000000c0507100 RDI: 0000000000000007
[Thu Aug 11 16:53:41 2022] RBP: 00007ffc1e695be0 R08: 0000000000000000 R09: 0000000000000040
[Thu Aug 11 16:53:41 2022] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
[Thu Aug 11 16:53:41 2022] R13: 0000000000000007 R14: 0000000000000009 R15: 0000000000000000
[Thu Aug 11 16:53:41 2022] usdm_drv: userMemAlloc:395 Unable to allocate memory slab or wrong alignment:           (null)
[Thu Aug 11 16:53:41 2022] usdm_drv: dev_mem_alloc:600 userMemAlloc failed

Nginx error log:

ioctl_alloc_slab:951 ioctl call for mem allocation failed, ret = -1
[error] SalCtrl_ServiceInit() - : Failed to initialise all service instances
[error] SalCtrl_ServiceEventStart() - : Private data is NULL
[error] SalCtrl_AdfServicesStartedCheck() - : Sal Ctrl failed to start in given time

[error] do_userStart() - : Failed to start services

[error] SalCtrl_AdfServicesStartedCheck() - : Sal Ctrl failed to start in given time

[error] do_userStart() - : Failed to start services

[error] SalCtrl_AdfServicesStartedCheck() - : Sal Ctrl failed to start in given time

[error] do_userStart() - : Failed to start services

[error] SalCtrl_AdfServicesStartedCheck() - : Sal Ctrl failed to start in given time

[error] do_userStart() - : Failed to start services

Need help

  1. I don't know what condition will trigger the deadloop issue?
  2. Is the issue fixed, and which version the issue is fixed.
fuchencong commented 2 years ago

Can someone comment on the issue. It's produced on our product environment

Yogaraj-Alamenda commented 2 years ago

Can you please see if you can reproduce with latest of everything? There are some similar issues fixed in later version of engine and nginx. (not sure about exact version)

QAT Engine v0.6.15 QAT driver: QAT.L.4.18.1-00001 async mode Nginx : v0.4.7

fuchencong commented 2 years ago

@Yogaraj-Alamenda Thank you for you reply. Can you help to provide more information about the bug. When will the issue be triggered? So I can reproduce the issue and ensure the latest version can fix the issue.

fuchencong commented 2 years ago

@Yogaraj-Alamenda BTW, we not use async mode Nginx, we use Tengine, which also provide asynchronous mode OpenSSL. You can refer to http://tengine.taobao.org/document/tengine_qat_ssl.html

akf00000 commented 10 months ago

I have the same problem. I found is show after reload nginx.