intel / isa-l_crypto

Other
275 stars 80 forks source link

Poor performance in single-buffer test #45

Open debug-zhang opened 4 years ago

debug-zhang commented 4 years ago

We test run single hash jobs on one core:

OpenSSL is better than ISA-L(same result in update test):

Library Algorithm 100B(us) 1KB(us) 256KB(us) 512KB(us) 1MB(us)
OpenSSL MD5 0.259577 1.973022 466.580400 930.553100 1858.246400
ISA-L MD5 0.510208 3.434382 788.836000 1580.523900 3185.874900
OpenSSL SHA1 0.243943 1.435872 329.519600 654.810200 1309.367800
ISA-L SHA1 0.326654 1.885040 393.532000 786.502300 1573.971600
OpenSSL SHA256 0.432444 3.031160 714.659800 1429.176200 2858.618900
ISA-L SHA256 0.597858 4.251419 997.934900 1993.451800 3989.759000
OpenSSL SHA512 0.337439 2.207246 477.429400 953.789300 1908.231500
ISA-L SHA512 0.598538 4.924967 1119.788800 2237.286500 4411.418500

If the test is wrong, please look through my steps above and tell me what I am doing wrong. If not, please tell me why this result. Thanks

gbtucker commented 4 years ago

Hi @blackbird52. I can't say what is wrong with your test. As you may know, the biggest benefit for using the multi-buffer hashing comes from submitting multiple independent jobs at once. We do have some special single-buffer or even two-buffer optimizations for when we detect the number of lanes filled is only at 1 or 2.

Comphix commented 4 years ago

@blackbird52 , could you give more information like CPU, memory and other related system configuration?

answer3x commented 4 years ago

Have the same problem.

We try to test openssl and isa-l,found the performance of isa-l single buffer for SHA256 is worse than Openssl.

debug-zhang commented 4 years ago

@blackbird52 , could you give more information like CPU, memory and other related system configuration?

Hardware & Software Ingredients

Item Description
CPU Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
Number of cores 12, Number of threads 48
Memory 128 GB
Linux kernel version 3.10.107
GCC version 4.8.5
OpenSSL version 1.0.1e
ISA-L version 2.22.0
guymguym commented 4 years ago

I see the same on mac and linux containers.

And this is my machine:

macOS Catalina 10.15.4 (19E287)
CPU: 2.6 GHz 6-Core Intel Core i7
Memory: 16 GB 2400 MHz DDR4
gcc: Apple clang version 11.0.3
yasm: 1.3.0
isa-l_crypto: tag v2.22.0
openssl: OpenSSL 1.1.1g  21 Apr 2020

Here is my only diff in md5_mb/md5_mb_vs_ossl_perf.c:

 // Set number of outstanding jobs
-#define TEST_BUFS 32
+#define TEST_BUFS 1

building:

make DEFINES="-I/usr/local/opt/openssl/include/" LDFLAGS="-L/usr/local/opt/openssl/lib/" -f Makefile.unx md5_mb_vs_ossl_perf

running:

md5_openssl_cold: runtime =    4260941 usecs, bandwidth 3200 MB in 4.2609 sec = 787.49 MB/s
multibinary_md5_cold: runtime =    7198752 usecs, bandwidth 3200 MB in 7.1988 sec = 466.11 MB/s
Multi-buffer md5 test complete 1 buffers of 33554432 B with 100 iterations
 multibinary_md5_ossl_perf: Pass

I can also get similar 2x slower times in linux VM.

Let me know if I can provide more information.

gbtucker commented 4 years ago

@guymguym I think there is an obvious issue with your results.

md5_openssl_cold: runtime = 4260941 usecs, bandwidth 3200 MB in 4.2609 sec = 787.49 MB/s multibinary_md5_cold: runtime = 7198752 usecs, bandwidth 3200 MB in 7.1988 sec = 466.11 MB/s

I would expect multiple gigabytes/s here. Could it be that the host is not passing on all the native instruction set support to the VM? If configured to pass native, the VM should see AVX2.

@blackbird52, is the included multi-buffer test showing expected results?

debug-zhang commented 4 years ago

@gbtucker , yes, multi-buffer test got expected results.

guymguym commented 4 years ago

@gbtucker @blackbird52 did you notice my change to #define TEST_BUFS 1 - I meant to test a single buffer. So I don’t expect to get multiple gigabytes, but the problem is that I do expect to get similar results to single buffer of openssl (787MB/s) and instead I get half the performance (466MB/s)...

Isn't that the same as the original issue? I just used the md5_mb_vs_ossl_perf with a small edit to single buffer to test quickly.

debug-zhang commented 4 years ago

@gbtucker I'm concerned about the performance in single-buffer test and want to know why we got these results. Is OpenSSL better than ISA-L in this case, or is there any optimization to run ISA-L effectively? Thanks!

gbtucker commented 4 years ago

Sorry @guymguym, I did miss the #define TEST_BUFS 1 meaning that you had modified the test.

As I said there are some optimizations for single and dual buffer but many of these are for later CPUs. If your system will primarily only utilize a single buffer at a time you may find the integration for multi-buffer hashing is not worth it. It may be difficult to say where the crossover point is for your integration without some experimentation.

guymguym commented 4 years ago

thanks @gbtucker. Even with 2 buffers I already get better performance:

md5_openssl_cold: runtime =    4323584 usecs, bandwidth 3200 MB in 4.3236 sec = 776.08 MB/s
multibinary_md5_cold: runtime =    3604363 usecs, bandwidth 3200 MB in 3.6044 sec = 930.94 MB/s
Multi-buffer md5 test complete 2 buffers of 16777216 B with 100 iterations
 multibinary_md5_ossl_perf: Pass

However, I am not sure how to use multibuffers in my case - perhaps you can help me understand. My server is processing multiple streams of md5 concurrently (an S3 endpoint). When a stream starts it initializes an md5 context, and then whenever a buffer is read from the socket it is submitted to the mgr with its stream's context. But then I immediately have to call flush, because I want to submit the next buffer of that stream, so essentially to use more than one multibuffer per flush, what kind of event loop synchronization or "tick" should I implement? Is there a reference project that uses multibuffer for data streaming?

gbtucker commented 4 years ago

@guymguym, it sounds like you want to split hash jobs into partial updates. This is entirely possible with the multi-buffer interface without having to flush after each. Typically the updates are tracked in the same job context with ctx.user_data, filled with info to act like a callback to track, and resubmit a job with the next update. Some of the examples and unit tests do this.

I would suggest having a context pool worker that manages this. Also rolling_hash/chunking_with_mb_hash.c is a good example although this one doesn't use updates.

animetosho commented 3 years ago

Just a late bit of (hopefully useful) info for anyone who stumbles across this:

From my quick scan of the code, it seems like the 1 or 2 buffer optimisations @gbtucker mentions only applies to SHA1/SHA256. For MD5, which is what @guymguym is testing here, there doesn't appear to be any optimisations for fewer than the ideal number of buffers, so ISA-L will always use the full SIMD width with two interleaved vectors regardless of how many lanes are active.
OpenSSL's implementation is quite good (though it can be beaten), and will be much faster than ISA-L's current "max-throughput"-only approach for single buffer scenarios.