intel / QAT_Engine

Intel QuickAssist Technology( QAT) OpenSSL Engine (an OpenSSL Plug-In Engine) which provides cryptographic acceleration for both hardware and optimized software using Intel QuickAssist Technology enabled Intel platforms. https://developer.intel.com/quickassist
BSD 3-Clause "New" or "Revised" License
408 stars 129 forks source link

QAT + OPENSSL 1.1.1q performance #212

Open Kaustubh2455 opened 2 years ago

Kaustubh2455 commented 2 years ago

I have been trying to reproduce the numbers published by intel on my setup. https://01.org/sites/default/files/downloads/intelr-quickassist-technology/337003-001-intelquickassisttechnologyandopenssl-110.pdf In the above document Table 2. Performance of RSA 2K with OpenSSL speed it is mentioned that sign/s is around 100k and verify/s is around 200k. The numbers I get for verify/s are similar but for sign/s I get around 27k and for ecdsap256 both sign/s and verify/s are lower than what is mentioned in the document. here's a document in which I have mentioned the steps I followed and the results.I have also added the software runs for both the cases. I am using c62x chipset with Quickassist. https://docs.google.com/document/d/15w_fKrlEkGoIcu1cHmPa2VAWXhnhoyKA5Y2MK--OJuE/edit?usp=sharing

Need help to figure out why the performance is low.

Yogaraj-Alamenda commented 2 years ago

@Kaustubh2455 Can you share your driver config details ? Also does it have 3 QAT devices in it (you will have to check with lspci | grep Co-p) . It looks like you have only 1 QAT device by looking into your performance.

Kaustubh2455 commented 2 years ago

@Yogaraj-Alamenda thanks for the reply. I am using the default driver configurations provided by qat. I tried all 4 multi_process_optimized multi_thread_event-driven_optimized multi_thread_optimized multi_process_event-driven_optimized present at QAT_Engine/qat/config/c6xx configurations provided by QAT. Do I need to make any changes to the configurations?

I do have three QAT devices and i copy the configuration for all three of them. lspci | grep Co-p 1a:00.0 Co-processor: Intel Corporation C62x Chipset QuickAssist Technology (rev 04) 1b:00.0 Co-processor: Intel Corporation C62x Chipset QuickAssist Technology (rev 04) 1c:00.0 Co-processor: Intel Corporation C62x Chipset QuickAssist Technology (rev 04)

Kaustubh2455 commented 2 years ago

hi @Yogaraj-Alamenda I looked into your comment about only having 1 QAT device and I found out by looking at fw_counters that only 1 of the devices receives the requests for the other two the fw_counters are zero. Not sure how to make them send/recieve requests. `# service qat_service status ● qat_service.service - LSB: modprobe the QAT modules, which loads dependant modules, before calling the user space utility to pass configuration parameters Loaded: loaded (/etc/init.d/qat_service; generated) Active: active (exited) since Mon 2022-09-19 03:16:47 PDT; 14min ago Docs: man:systemd-sysv-generator(8) Process: 27514 ExecStop=/etc/init.d/qat_service stop (code=exited, status=0/SUCCESS) Process: 27582 ExecStart=/etc/init.d/qat_service start (code=exited, status=0/SUCCESS)

Sep 19 03:16:46 compute1.lilac.local qat_service[27582]: Restarting all devices. Sep 19 03:16:46 compute1.lilac.local qat_service[27582]: Processing /etc/c6xx_dev0.conf Sep 19 03:16:46 compute1.lilac.local qat_service[27582]: Processing /etc/c6xx_dev1.conf Sep 19 03:16:47 compute1.lilac.local qat_service[27582]: Processing /etc/c6xx_dev2.conf Sep 19 03:16:47 compute1.lilac.local qat_service[27582]: Checking status of all devices. Sep 19 03:16:47 compute1.lilac.local qat_service[27582]: There is 3 QAT acceleration device(s) in the system: Sep 19 03:16:47 compute1.lilac.local qat_service[27582]: qat_dev0 - type: c6xx, inst_id: 0, node_id: 0, bsf: 0000:1a:00.0, #accel: 5 #engines: 10 state: up Sep 19 03:16:47 compute1.lilac.local qat_service[27582]: qat_dev1 - type: c6xx, inst_id: 1, node_id: 0, bsf: 0000:1b:00.0, #accel: 5 #engines: 10 state: up Sep 19 03:16:47 compute1.lilac.local qat_service[27582]: qat_dev2 - type: c6xx, inst_id: 2, node_id: 0, bsf: 0000:1c:00.0, #accel: 5 #engines: 10 state: up Sep 19 03:16:47 compute1.lilac.local systemd[1]: Started LSB: modprobe the QAT modules, which loads dependant modules, before calling the user space utility to pass configuration parameters.`

Yogaraj-Alamenda commented 2 years ago

You will have to configure your driver config file accordingly to use all the 3 devices inline with number of process in the -multi option with speed command

For instance, Speed test command below with multi 3 option can achieve the target of 100K op/s utilizing all the 3 devices. ./openssl speed -engine qatengine -async_jobs 36 -multi 3 rsa2048

[SHIM] NumberCyInstances = 2 NumberDcInstances = 0 NumProcesses = 1 LimitDevAccess = 1

Kaustubh2455 commented 2 years ago

Hi @Yogaraj-Alamenda Thanks for the response.

I tried with the SHIM config for all three devices i.e. [SHIM] NumberCyInstances = 2 NumberDcInstances = 0 NumProcesses = 1 LimitDevAccess = 1

which replaced the existing config which was [SHIM] NumberCyInstances = 1 NumberDcInstances = 0 NumProcesses = 16 LimitDevAccess = 1 (with this config I tried using -multi 3 and still only one of the devices gets all the requests)

on changing the config and restarting the service I see these errors in the dmesgs which causes the hw initialisation to fail.

[ +0.107276] c6xx 0000:1a:00.0: Get core number failed with error -14 [ +0.006441] c6xx 0000:1a:00.0: Failed to create rings for cy [ +0.005723] c6xx 0000:1a:00.0: Failed to process user section SHIM [ +0.006354] c6xx 0000:1a:00.0: Failed to config device [ +0.006505] c6xx 0000:1b:00.0: Get core number failed with error -14 [ +0.006426] c6xx 0000:1b:00.0: Failed to create rings for cy [ +0.005705] c6xx 0000:1b:00.0: Failed to process user section SHIM [ +0.006257] c6xx 0000:1b:00.0: Failed to config device [ +0.005623] c6xx 0000:1c:00.0: Get core number failed with error -14 [ +0.006401] c6xx 0000:1c:00.0: Failed to create rings for cy [ +0.005695] c6xx 0000:1c:00.0: Failed to process user section SHIM [ +0.006237] c6xx 0000:1c:00.0: Failed to config device

do i need to add anything more in the config when i do NumberCyInstances = 2? I also tried changing NumberCyInstances to other values other than 1 and got the same error as before.

Yogaraj-Alamenda commented 2 years ago

Sorry I missed to inform that you also need to add instance section like below for configuring 2 instance

[SHIM] NumberCyInstances = 2 NumberDcInstances = 0 NumProcesses = 1 LimitDevAccess = 1

Crypto - User instance #0

Cy0Name = "UserCY0" Cy0IsPolled = 1

List of core affinities

Cy0CoreAffinity = 0

Crypto - User instance #1

Cy1Name = "UserCY1" Cy1IsPolled = 1

List of core affinities

Cy1CoreAffinity = 1

Kaustubh2455 commented 2 years ago

Hi @Yogaraj-Alamenda thanks, with these changes I got all three qat devices working and got the 100k ops/sec number.

But the numbers that intel published were with a single core and without multi. specifically referring to table 2 in this doc https://01.org/sites/default/files/downloads/intelr-quickassist-technology/337003-001-intelquickassisttechnologyandopenssl-110.pdf.

Is there anything else that can be done to get the numbers published with a single core?

Yogaraj-Alamenda commented 2 years ago

AFIK, We need atleast 3 cores to reach the 100K cps. Let me check internally to see if it was the case in the whitepaper as well.

Kaustubh2455 commented 2 years ago

hi @Yogaraj-Alamenda I have been trying to get the perf numbers for qat with Nginx and what I observe is that with 1c2t my numbers are similar to what has been published but as soon as I increase the number of cores and threads the performance seems to drop. These are the numbers that I see with RSA

qat | CPU % | Connections/sec -- | -- | -- 1c 2t | 94 | 6041.2 2c4t | 58 | 3391.49 4c8t | 61 | 4338.73 8c16t | 64 | 7074.14 10c20t | 60 | 7808.76 I am using Nginx version 1.20.1 with the async mode nginx patch (https://github.com/intel/asynch_mode_nginx) OpenSSL 1.1.1q this is my Nginx config:- load_module modules/ngx_ssl_engine_qat_module.so; worker_processes 20; worker_cpu_affinity 111111111100000000001111111111; events { worker_connections 65535; accept_mutex off; multi_accept off; use epoll; } ssl_engine { use_engine qatengine; default_algorithms RSA,EC,DH,DSA; qat_engine { qat_offload_mode async; qat_notify_mode poll; qat_poll_mode heuristic; qat_sw_fallback on; } } http { include mime.types; default_type application/octet-stream; sendfile on; keepalive_timeout 65; server { listen 80; server_name localhost; location / { root html; index index.html index.htm; } error_page 500 502 503 504 /50x.html; location = /50x.html { root html; } } server { listen 443 ssl reuseport asynch; server_name localhost; ssl_protocols TLSv1.2; ssl_certificate /home/images/ats-data/cert.crt; ssl_certificate_key /home/images/ats-data/cert.key; ssl_asynch on; ssl_session_timeout 5m; ssl_ciphers HIGH:!aNULL:!MD5; ssl_prefer_server_ciphers on; location / { root html; index index.html index.htm; } } The qat driver config is the same as was used to get the OpenSSL speed numbers. I am seeing the requests going to all three devices the CPU is also not maxing out so not sure what may be the problem here. Are there any other changes required in the qat driver config or do I need to tune something else differently for it to work? Thanks in advance.