intel / QAT_Engine

Intel QuickAssist Technology( QAT) OpenSSL Engine (an OpenSSL Plug-In Engine) which provides cryptographic acceleration for both hardware and optimized software using Intel QuickAssist Technology enabled Intel platforms. https://developer.intel.com/quickassist
BSD 3-Clause "New" or "Revised" License
409 stars 128 forks source link

multi-thread performance #138

Open jazune opened 4 years ago

jazune commented 4 years ago

I have two 8960 qat cards in hands. I use multi-process program, it can reach 20w qps of RSA_Sign 2048, But when I use multi-thread , it can noly reach 9w qps, and can't reach more. I use async job mode, I don't have lock in application layer. But the bottleneck in perf seems raw_spin_lock。 Did you test multi-thread ?Is there some bottleneck in multi-thread mode?

my qat config is [SHIM] NumberCyInstances = 12 NumberDcInstances = 0 NumProcesses = 4 LimitDevAccess = 0

Crypto - User instance #0 Cy0Name = "UserCY0" Cy0IsPolled = 1 Cy0CoreAffinity = 0 ... Crypto - User instance #11

Yogaraj-Alamenda commented 4 years ago

Hi @jazune Could you please let us know what version of driver, Engine and OpenSSL you are using? And what memory driver you are using, USDM or qat_contig_mem? Also could you please share the profile showing bottleneck with raw_spin_lock. The locks might be coming from OpenSSL stack.

Also what application you are using to measure the individual algorithm performance. Could not understand "20w qps of RSA_Sign 2048".

Thanks, Yoga

jazune commented 4 years ago

hi @Yogaraj-Alamenda Thank you for your reply. My driver version is 4.4.0, using USDM, and my qat_engine version is v0.5.38. I have 2 8960 cards, qat_service status: There are 6 QAT acceleration device(s) in the system: qat_dev0 - type: c6xx, inst_id: 0, node_id: 0, bsf: 0000:1a:00.0, #accel: 5 #engines: 10 state: up qat_dev1 - type: c6xx, inst_id: 1, node_id: 0, bsf: 0000:1b:00.0, #accel: 5 #engines: 10 state: up qat_dev2 - type: c6xx, inst_id: 2, node_id: 0, bsf: 0000:1c:00.0, #accel: 5 #engines: 10 state: up qat_dev3 - type: c6xx, inst_id: 3, node_id: 0, bsf: 0000:60:00.0, #accel: 5 #engines: 10 state: up qat_dev4 - type: c6xx, inst_id: 4, node_id: 0, bsf: 0000:61:00.0, #accel: 5 #engines: 10 state: up qat_dev5 - type: c6xx, inst_id: 5, node_id: 0, bsf: 0000:62:00.0, #accel: 5 #engines: 10 state: up I use openssl-1.1.0h, and use async job to compute RSA_sign and RSA_private_decrypt 2048bit I use multi-thread mode, Master thread set engine, and worker threads use async job to compute rsa.

related code:

ENGINE* engine; engine = ENGINE_by_id("qat"); ... if (engine == NULL) { engine = ENGINE_by_id("dynamic"); ..... } if (!ENGINE_init(engine)) { LOG(ERROR) << "ks: call ENGINE_init() failed." << std::endl; goto fail; } if (ENGINE_set_default(engine, ENGINE_METHOD_ALL) == 0) { LOG(WARNING) << "ks: call ENGINE_set_default() failed." << std::endl; goto fail; } .....

if (async_handletype == kAsyncDecrypt) { ret = ASYNC_start_job(&asyncjob, waitctx, &func_ret, rsa_privatedecrypt, (void *)&args, sizeof(AsyncArgs)); } else { ret = ASYNC_start_job(&asyncjob, waitctx, &func_ret, rsasign, (void *)&args, sizeof(AsyncArgs)); }

I try to set instance for each worker,but still can't get better performace。 Now,I can still only get 90000 qps RSA_sign, But two 8960 cards should reach 200000 qps。 the bottleneck is raw_spin_lock,but the source is unknown。perf file is attached。

perf.tar.gz

jazune commented 4 years ago

hi @Yogaraj-Alamenda ,I compile the libqat.so with the -fno-omit-frame-pointer option,and I get the function of the perf。It seems the lock happend in qat_rsa_priv_enc。

perf

jazune commented 4 years ago

hi ,@Yogaraj-Alamenda:

I am now sure the lock or the bottleneck is in memory pool manager in qat from the perf. There are 3 files may be used in qat mem manager: multi_thread_qaememutils.c, qae_mem_utils.c, cmn_mem_drv_inf.c.

I tried to remove --enable-usdm option and add the --enable-multi_thread option, so I can compile multi_thread_qaememutils.c to manager mem and it has no locks, but in this case, my program will core. But when I only add enable-multi_thread option, and --enable-usdm option is on, it will still use cmn_mem_drv_inf.c to manager memory, and it has locks.

Since you have the --enable-multi_thread option, how can I use it ? Or is there any other method to optimize the lock in qat memory manager in multi-thread mode ?

Thanks~

Yogaraj-Alamenda commented 4 years ago

Hi @jazune , Sorry I dint find time to look into this issue, As per v0.5.38, enable-multi_thread without enable-usdm uses qat_contig_mem driver with multi_thread_qaememutils.c interface which has no locks whereas USDM mem driver doesn't have this. qat_contig_mem with enable-multi_thread will give improved performance in a multi-threaded environment. But Please note that the qat_contig_mem is not production quality driver and may have issues. It would be good if you can share the issue you are facing in the qat_contig_mem driver we will have a look and fix this. BTW, Are you testing with entire stack doing handshake or individual RSA operation with test application. It could be possible that locks comes from libSSL stack. The previous perf data doesn't have much information on the sources. Could you please send the latest Perf data?

jazune commented 4 years ago

hi @Yogaraj-Alamenda Thank you for your reply.

The perf file is attached, and accoding to the perf, I think the qat mem manager is the bottleneck. We only test individual RSA operation, just like openssl speed with async job, what the difference is we use multi thread, but openssl use multi process. I have several question:

1,You said that qat_contig_mem is not production quality driver and may have issues, but does usdm mem driver have a multi thread mem manager version without lock ?

2, Dose driver 1.7 and qat-engine v0.5.38 support qat_contig_mem ? If I want to try qat_contig_mem, which driver version and qat-engine version do you recommend?

3, Do you have any other suggestion about our case? Our purpose is to use a multi-thread program to get a full performance of 2-4 8960 cards in one machine. (2048bit RSA)

perf.tar.gz

Yogaraj-Alamenda commented 4 years ago

@jazune Again the perf file doesnt show up the symbols and only the address is displayed

image

Answers for your questions below:

  1. USDM mem driver doesn't have mem manager interface without locks.
  2. Yes qat v0.38 itself supports qat_contig_mem. Please refer README.md inside the repo for building and installing qat_contig_mem.
  3. I think running with qat_contig_mem will give us some idea on the locks with the engine. Also hope you are setting instance for thread.
jazune commented 4 years ago

hi @Yogaraj-Alamenda In my matchine, the perf show up the symbols.

perf

q2 : qat engine use v0.38, what about driver version? 1.6 or 1.7? q3: We try to set instance for each thread, but it can't not improve the performace.

Can you test it? And your own multi-thread test program may also have this problem.

Yogaraj-Alamenda commented 4 years ago

@jazune Yes we will check with multi thread and let you know.

Regarding the driver configuration, I would suggest you to use the same driver, OpenSSL and QAT Engine which you are running now and use the QAT Contig mem instead of USDM . It will work with all versions of the driver and Engine

jazune commented 4 years ago

hi @Yogaraj-Alamenda I use the contig mem instread of usdm, and change one line code of multi_thread_qaememutils.c static __thread int crypto_inited = 0; now I can get max performace . Next I will try 4 8960 cards in one machine, and test the stability of this mode. Thank you for your reply again.

Syllinia commented 2 years ago

Hi @Yogaraj-Alamenda,

USDM mem driver doesn't have mem manager interface without locks.

Currently I'm using QAT driver QAT1.7.L.4.18.1-00001 and QAT_Engine-0.6.14. Does USDM mem driver have mem manager interface without locks now? My application is also multi-threaded and async mode. Many thanks in advance.

Regards, Allen

Yogaraj-Alamenda commented 2 years ago

@Syllinia AT present there is no support in USDM for an interface without locks. Will keep you posted on the update.

Syllinia commented 11 months ago

static __thread int crypto_inited = 0;

#USDM mem driver doesn't have mem manager interface without locks.

Hi @Yogaraj-Alamenda , I'm using QAT engine V1.4.0, driver QAT.L.4.23.0-00001 and Openssl 3.1.3, I notice that USDM now has this new feature. Support for thread specific memory to avoid locks (QAT_HW Version 1.7 & 1.8 only)

The USDM thread specific memory can be enabled in QAT_HW driver using the below configure flags in driver build which is only needed for multithreaded application for performance improvement. This is supported from version 4.20 of QAT_HW Version 1.7 driver only.

./configure --enable-icp-thread-specific-usdm --enable-128k-slab

My question is whether the above USDM new options(--enable-icp-thread-specific-usdm --enable-128k-slab) will get better performance comparing with qat_contig_mem in the multithreaded application, such as lockless? Thanks in advance.

Yogaraj-Alamenda commented 11 months ago

@jazune The thread specific USDM (lockless USDM) support at USDM driver will give same level of performance as lockless qat_contig_mem,

Syllinia commented 11 months ago

Thanks so much.

How about plock as follows? Seems it replaces pthread's rwlock in application not in QAT engine? Am I right? Thanks. image

Yogaraj-Alamenda commented 11 months ago

Thanks so much.

How about plock as follows? Seems it replaces pthread's rwlock in application not in QAT engine? Am I right? Thanks. image @Syllinia rw_lock bottleneck is mostly in OpenSSL and this plock replaces rwlock coming from OpenSSL.