intel / intel-ipsec-mb

Intel(R) Multi-Buffer Crypto for IPSec
BSD 3-Clause "New" or "Revised" License
288 stars 87 forks source link

how about the performance if my application uses this lib in synchronous way? #97

Closed riveridea closed 2 years ago

riveridea commented 2 years ago

It is recommended to apply this lib in a asynchronous way as I see in the document and guide. Is it because the parallelization is only achieved in processing mulit-jobs? If so, how about the if I use it in synchronous way, i.e., submit a job and wait for the job to be completed before submitting next job in my application. Is it working and how about the performance compared with the OpenSSL single buffer implementation in this situation? Thanks,

tkanteck commented 2 years ago

Yes, it is possible to submit a job and request it back (flush() api) and it is expected to be suboptimum for multi-buffer implementations. It is possible but unfair to compare single buffer implementation vs multi-buffer one because they are designed in a different way. The name of the library includes "multi-buffer" keyword but not all algorithms are implemented using this innovation. If parallelism can be achieved by processing multiple blocks from a single buffer then this is the preferred method (not always possible). README includes two tables with cipher and hash algorithms:

If you are interested in specific algorithm then please check if it's implemented as single or multi buffer (above mentioned tables). If it is multi-buffer and you require synchronous mode then using OpenSSL is probably a better option. If it is a single buffer implementation then check which of the two gives you better performance.

riveridea commented 2 years ago

Thanks for the prompt response. The algorithm I am concerned is the HMAC_SHA1 and it is currently already running on the OpenSSL in a synchronous mode, but I do want to see how much the whole application can be accelerated if turning to this library. More questions:

  1. Are there any existing examples in this repository show how the asynchronous mode is actually working? Just want to confirm my understanding is correct.
  2. Really interested in how the multi-buffer implementation is working, but I am not good at the assembly code. If the actually multi-buffer implementation needs multi-jobs to carry out the parallelization, does it mean the algorithm code has to cache enough jobs coming in before starting the actual computing? If it is the case, how many jobs does the algorithm needs to wait before the computing?

Thanks,

tkanteck commented 2 years ago

The algorithm I am concerned is the HMAC_SHA1 and it is currently already running on the OpenSSL in a synchronous mode, but I do want to see how much the whole application can be accelerated if turning to this library

What about running openssl speed and perf/ipsec_perf applications and comparing the results using same units (throughput or cycles per buffer)? This is wiki page on using ipsec_perf: https://github.com/intel/intel-ipsec-mb/wiki/LibPerfApp

  1. Are there any existing examples in this repository show how the asynchronous mode is actually working? Just want to confirm my understanding is correct.

Example async CBCS code is here https://github.com/intel/intel-ipsec-mb/wiki/MPEG-CENC-in-CBCS-mode#sample-application

  1. Really interested in how the multi-buffer implementation is working, but I am not good at the assembly code. If the actually multi-buffer implementation needs multi-jobs to carry out the parallelization, does it mean the algorithm code has to cache enough jobs coming in before starting the actual computing?

Yes, there is some scheduling involved that is hidden under the API. No need to do it on the application side.

If it is the case, how many jobs does the algorithm needs to wait before the computing?

Again, please have a look into README and the two tables I already mentioned. They explain everything.

More about multi-buffer can be found in publications section on the project wiki https://github.com/intel/intel-ipsec-mb/wiki/Publications

riveridea commented 2 years ago

thanks for these great information.

tkanteck commented 2 years ago

how is the hmac-sha1 benchmarking going? any help needed?

riveridea commented 2 years ago

how is the hmac-sha1 benchmarking going? any help needed?

Evaluation is ongoing. Some more questions, I put here.

  1. Does this library has any limitation when working with hyperthreading enabled? As two sibling logical cores are running over the same physical core, is there any possible affect if the two logical cores are sharing the same AESNI/SHANI hardware?
  2. Does this library has any limitation in supporting multi-cores? I am trying to run this lib over a system with around 32 physical cores. DPDK AESNI MB Poll mode driver is used as interface with this library, however DPDK AESNI MB PMD only supports maximum 8 queue pair, thus I am worrying if it means there could be a limit of current cores as 8 if applying this lib as the crypto engine.
tkanteck commented 2 years ago
  1. Does this library has any limitation when working with hyperthreading enabled? As two sibling logical cores are running over the same physical core, is there any possible affect if the two logical cores are sharing the same AESNI/SHANI hardware?

Hyper threading is technology that uses single physical core resources to drive two logical architectural states. Consequently, if both threads on the same core run crypto operations then each thread produces roughly 50% of the throughput of the single core single thread (1C1T). On aggregate, two threads typically give slightly better throughput through slightly better core utilization.

  1. Does this library has any limitation in supporting multi-cores? I am trying to run this lib over a system with around 32 physical cores. DPDK AESNI MB Poll mode driver is used as interface with this library, however DPDK AESNI MB PMD only supports maximum 8 queue pair, thus I am worrying if it means there could be a limit of current cores as 8 if applying this lib as the crypto engine.

As of today, the library itself has no multi-core enabling. DPDK addresses multi-core usage scenarios, failovers and multi-core crypto schedulers. I'll leave it to @pablodelara to shed more light onto it but I am not aware of any limitations here. This wiki section may be helpful.

riveridea commented 2 years ago

I reported a DPDK aesni_mb PMD bug here https://bugs.dpdk.org/show_bug.cgi?id=935 With this bug, the submitted jobs from the aesni_mb PMD cannot be parallel processed by the intel-ipsec-mb library. @pablodelara Could you help to have a look at this issue.

Sorry for post this issue here as I see no response of this issue in DPDK bugzilla.

pablodelara commented 2 years ago

Apologies for not replying earlier. I will look at this bug as soon as I can, thanks for reporting it!

riveridea commented 2 years ago

@tkanteck My recent evaluation shows the twice faster of the HMAC_SHA1 if I gather 16 packets to submit to this library as I see the AVX512 based SHA1 implementation is trying to collect 16 lanes data before performing the SHA1 computing if not using flushing. I have a question, as the comparison, the single buffer OpenSSL implementation is used. However it seems the OpenSSL also supports something like multi-block implementation for SHA1. I am wondering if you know anything about it if it is equivalent to the mublt-buffer implementation of Intel. Thanks,

tkanteck commented 2 years ago

Are you referring to this implementation https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha1-mb-x86_64.pl? It looks like AVX2 multi-buffer implementation. Sorry but I don't know how this code path can get executed through openssl speed - it may require some recompilation. Our library also supports AVX512 SHA1 multi-buffer implementation that delivers better throughput over AVX2 version. What hardware platform was used in the HMAC-SHA1 test?

riveridea commented 2 years ago

Yes, exactly what I was talking about. I agree with you that OpenSSL implementation only has AVX2 supported. The other difference is I did not find how to get that OpenSSL multi-block integrated to DPDK framework, especially in asynchronous mode. The reason why I am also pursuing the possible implementation over this OpenSSL impl, is because the intel-ipsec-mb has no FIPS certification. Please correct me if I am wrong.

The hardware ever used includes SKYLAKE and ICELAKE. It is confirmste the AVX512 version of submitjob is trigggered .

On Wed, Feb 23, 2022 at 5:39 AM Tomasz Kantecki @.***> wrote:

Are you referring to this implementation https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha1-mb-x86_64.pl? It looks like AVX2 multi-buffer implementation. Sorry but I don't know how this code path can get executed through openssl speed - it may require some recompilation. Our library also supports AVX512 SHA1 multi-buffer implementation that delivers better throughput over AVX2 version. What hardware platform was used in the HMAC-SHA1 test?

— Reply to this email directly, view it on GitHub https://github.com/intel/intel-ipsec-mb/issues/97#issuecomment-1048649094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHWZ2DCZM2COO3KCKAPG3LU4S2OBANCNFSM5MLC342Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

--

Alex,

tkanteck commented 2 years ago

I doubt that DPDK PMD leverages this OpenSSL multi-buffer SHA1 implementation but please feel free to check with them.

As to FIPS, you are correct intel-ipsec-mb doesn't have such certification. Also it doesn't fulfill requirements for CMVP certification. We are working on CAVP validation and certification (see new cavp test app added in v1.2) - WIP.

tkanteck commented 2 years ago

Any update here?

riveridea commented 2 years ago

No update on my side. Actually I am interested in the CMVP progress for this lib in Intel. It will be good if Intel can finish the certification.

tkanteck commented 2 years ago

CMVP cannot be done on this library at the moment as it doesn't fulfill technical requirements for it. ACVP (or CAVP) effort is in progress.

tkanteck commented 2 years ago

If there is no update on the performance side then I'd like to close this issue.

We'll keep informing about CAVP progress through announcements

riveridea commented 2 years ago

@tkanteck Yes Please close it.