Open debug-zhang opened 4 years ago
Hi @blackbird52. I can't say what is wrong with your test. As you may know, the biggest benefit for using the multi-buffer hashing comes from submitting multiple independent jobs at once. We do have some special single-buffer or even two-buffer optimizations for when we detect the number of lanes filled is only at 1 or 2.
@blackbird52 , could you give more information like CPU, memory and other related system configuration?
Have the same problem.
We try to test openssl and isa-l,found the performance of isa-l single buffer for SHA256 is worse than Openssl.
@blackbird52 , could you give more information like CPU, memory and other related system configuration?
Hardware & Software Ingredients
Item | Description |
---|---|
CPU | Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHzNumber of cores 12, Number of threads 48 |
Memory | 128 GB |
Linux kernel version | 3.10.107 |
GCC version | 4.8.5 |
OpenSSL version | 1.0.1e |
ISA-L version | 2.22.0 |
I see the same on mac and linux containers.
And this is my machine:
macOS Catalina 10.15.4 (19E287)
CPU: 2.6 GHz 6-Core Intel Core i7
Memory: 16 GB 2400 MHz DDR4
gcc: Apple clang version 11.0.3
yasm: 1.3.0
isa-l_crypto: tag v2.22.0
openssl: OpenSSL 1.1.1g 21 Apr 2020
Here is my only diff in md5_mb/md5_mb_vs_ossl_perf.c:
// Set number of outstanding jobs
-#define TEST_BUFS 32
+#define TEST_BUFS 1
building:
make DEFINES="-I/usr/local/opt/openssl/include/" LDFLAGS="-L/usr/local/opt/openssl/lib/" -f Makefile.unx md5_mb_vs_ossl_perf
running:
md5_openssl_cold: runtime = 4260941 usecs, bandwidth 3200 MB in 4.2609 sec = 787.49 MB/s
multibinary_md5_cold: runtime = 7198752 usecs, bandwidth 3200 MB in 7.1988 sec = 466.11 MB/s
Multi-buffer md5 test complete 1 buffers of 33554432 B with 100 iterations
multibinary_md5_ossl_perf: Pass
I can also get similar 2x slower times in linux VM.
Let me know if I can provide more information.
@guymguym I think there is an obvious issue with your results.
md5_openssl_cold: runtime = 4260941 usecs, bandwidth 3200 MB in 4.2609 sec = 787.49 MB/s multibinary_md5_cold: runtime = 7198752 usecs, bandwidth 3200 MB in 7.1988 sec = 466.11 MB/s
I would expect multiple gigabytes/s here. Could it be that the host is not passing on all the native instruction set support to the VM? If configured to pass native, the VM should see AVX2.
@blackbird52, is the included multi-buffer test showing expected results?
@gbtucker , yes, multi-buffer test got expected results.
@gbtucker @blackbird52 did you notice my change to #define TEST_BUFS 1
- I meant to test a single buffer. So I don’t expect to get multiple gigabytes, but the problem is that I do expect to get similar results to single buffer of openssl (787MB/s) and instead I get half the performance (466MB/s)...
Isn't that the same as the original issue? I just used the md5_mb_vs_ossl_perf with a small edit to single buffer to test quickly.
@gbtucker I'm concerned about the performance in single-buffer test and want to know why we got these results. Is OpenSSL better than ISA-L in this case, or is there any optimization to run ISA-L effectively? Thanks!
Sorry @guymguym, I did miss the #define TEST_BUFS 1
meaning that you had modified the test.
As I said there are some optimizations for single and dual buffer but many of these are for later CPUs. If your system will primarily only utilize a single buffer at a time you may find the integration for multi-buffer hashing is not worth it. It may be difficult to say where the crossover point is for your integration without some experimentation.
thanks @gbtucker. Even with 2 buffers I already get better performance:
md5_openssl_cold: runtime = 4323584 usecs, bandwidth 3200 MB in 4.3236 sec = 776.08 MB/s
multibinary_md5_cold: runtime = 3604363 usecs, bandwidth 3200 MB in 3.6044 sec = 930.94 MB/s
Multi-buffer md5 test complete 2 buffers of 16777216 B with 100 iterations
multibinary_md5_ossl_perf: Pass
However, I am not sure how to use multibuffers in my case - perhaps you can help me understand. My server is processing multiple streams of md5 concurrently (an S3 endpoint). When a stream starts it initializes an md5 context, and then whenever a buffer is read from the socket it is submitted to the mgr with its stream's context. But then I immediately have to call flush, because I want to submit the next buffer of that stream, so essentially to use more than one multibuffer per flush, what kind of event loop synchronization or "tick" should I implement? Is there a reference project that uses multibuffer for data streaming?
@guymguym, it sounds like you want to split hash jobs into partial updates. This is entirely possible with the multi-buffer interface without having to flush after each. Typically the updates are tracked in the same job context with ctx.user_data
, filled with info to act like a callback to track, and resubmit a job with the next update. Some of the examples and unit tests do this.
I would suggest having a context pool worker that manages this. Also rolling_hash/chunking_with_mb_hash.c
is a good example although this one doesn't use updates.
Just a late bit of (hopefully useful) info for anyone who stumbles across this:
From my quick scan of the code, it seems like the 1 or 2 buffer optimisations @gbtucker mentions only applies to SHA1/SHA256. For MD5, which is what @guymguym is testing here, there doesn't appear to be any optimisations for fewer than the ideal number of buffers, so ISA-L will always use the full SIMD width with two interleaved vectors regardless of how many lanes are active.
OpenSSL's implementation is quite good (though it can be beaten), and will be much faster than ISA-L's current "max-throughput"-only approach for single buffer scenarios.
We test run single hash jobs on one core:
ISA-L
OpenSSL
OpenSSL is better than ISA-L(same result in update test):
If the test is wrong, please look through my steps above and tell me what I am doing wrong. If not, please tell me why this result. Thanks