intel / asynch_mode_nginx

Async Mode Nginx with QAT support which improves Crypto and compression performance
https://developer.intel.com/quickassist
Other
213 stars 62 forks source link

Errors in nginx under high load when using qatengine #62

Closed dmpetroff closed 1 year ago

dmpetroff commented 1 year ago

Hello!

I'm benchmarking asynch-nginx performance using following software/hardware:

For SSL load I use tls-perf with 16 threads, 200 connections per thread:

./tls-perf.sh -T 10 -l 200 -t 16 --tls 1.2 -c AES256-GCM-SHA384 192.168.1.2 1443

It looks like that when device isn't fast enough to handle incoming requests, SSL connections are getting dropped with following messages in nginx error log:

2023/01/26 12:52:18 [crit] 121215#0: *184014 SSL_do_handshake() failed (SSL: error:800B5044:lib(128):qat_rsa_decrypt:internal error error:800B8044:lib(128):qat_rsa_priv_dec:internal error error:1419F093:SSL routines:tls_process_cke_rsa:decryption failed) while SSL handshaking, client: 192.168.1.1, server: 0.0.0.0:1443

And sometimes it even reports

2023/01/26 12:52:18 [crit] 121216#0: accept4() failed (24: Too many open files)

I'm using slightly modified version of dh895xcc/multi_process_event-driven_optimized/dh895xcc_dev0.conf:

[GENERAL]
ServicesEnabled = cy
ServicesProfile = CRYPTO
ConfigVersion = 2
CyNumConcurrentSymRequests = 2048
CyNumConcurrentAsymRequests = 1024
InterruptCoalescingTimerNs = 500

statsGeneral = 1
statsDh = 1
statsDrbg = 1
statsDsa = 1
statsEcc = 1
statsKeyGen = 1
statsDc = 1
statsLn = 1
statsPrime = 1
statsRsa = 1
statsSym = 1
ProcDebug = 0
AutoResetOnError = 0

[KERNEL]
NumberCyInstances = 0
NumberDcInstances = 0

[SHIM]
NumberCyInstances = 1
NumberDcInstances = 0
NumProcesses = 32
LimitDevAccess = 1
Cy0Name = "UserCY0"
Cy0IsPolled = 2
Cy0CoreAffinity = 0-31

And ssl engine in nginx is configured as

ssl_engine {
        use_engine qatengine;
        default_algorithms ALL;
        qat_engine {
                qat_offload_mode async;
                qat_notify_mode poll;
                qat_poll_mode external;
                qat_external_poll_interval 4; # that yields best results for me
        }
}

As a result I'm not able to pull more than 20k handshakes per second with such config despite nginx workers are not fully utilizing CPU. And if I try to add more client connections, then I ran into "internal error" problem described above.

Can you suggest anything to improve performance/get rid of that errors?

ShuaiYuan21 commented 1 year ago

Hi @dmpetroff Could you please try this cmd to expand the file descriptor number? It should solve the 'Too many open files' problem

ulimit -n 102400

At the same time, it is recommended that you can try to use ab for handshake testing. e.g.

ab -n 1000 -c 100 -f TLS1.2 -Z AES128-GCM-SHA256 https://127.0.0.1:4433/index.html
This is ApacheBench, Version 2.3 <$Revision$>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests

Server Software:        nginx/1.22.1
Server Hostname:        127.0.0.1
Server Port:            4433
SSL/TLS Protocol:       TLSv1.2,AES128-GCM-SHA256,2048,128

Document Path:          /index.html
Document Length:        615 bytes

Concurrency Level:      100
Time taken for tests:   0.269 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      854784 bytes
HTML transferred:       619920 bytes
Requests per second:    3713.32 [#/sec] (mean)
Time per request:       26.930 [ms] (mean)
Time per request:       0.269 [ms] (mean, across all concurrent requests)
Transfer rate:          3099.69 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       12   19   2.9     19      28
Processing:     2    7   2.2      7      13
Waiting:        0    4   1.8      4      11
Total:         15   26   3.3     26      35

Percentage of the requests served within a certain time (ms)
  50%     26
  66%     27
  75%     28
  80%     29
  90%     30
  95%     31
  98%     33
  99%     33
 100%     35 (longest request)
dmpetroff commented 1 year ago

@ShuaiYuan21 I've specified 16384 file descriptors in nginx config. There are 3200 (16 threads * 200 connections per thread) client connections to that nginx.

My question was more about the reason behind internal errors reported by qatengine via nginx error logs:

2023/01/26 12:52:18 [crit] 121215#0: *184014 SSL_do_handshake() failed
(SSL: error:800B5044:lib(128):qat_rsa_decrypt:internal error error:800B8044:lib(128):
qat_rsa_priv_dec:internal error error:1419F093:SSL routines:tls_process_cke_rsa:decryption failed)
while SSL handshaking, client: 192.168.1.1, server: 0.0.0.0:1443

I encounter such errors when all nginx workers are running under 100%-ish CPU usage (as reported by top)

dmpetroff commented 1 year ago

UPD: My mistake. I was assuming that worker_connections nginx parameter also sets RLIMIT_NOFILE, but there's a separate directive for that purpose. After tuning number of open files limit, that errors are gone, so I assume it was related to errors during creation of eventfd by qatengine.