Closed CypherGrue closed 1 year ago
Use EVP_MD_fetch(NULL, "SHA256", NULL)
instead of using the implicitly fetched EVP_sha256(). This should avoid the locks in EVP_DigestInit_ex() and you'll avoid the need to copy the context which is I suppose a slower workaround.
Anyway it is quite possible that the low-level call will still be slightly faster but there is no way around that.
Use
EVP_MD_fetch(NULL, "SHA256", NULL)
instead of using the implicitly fetched EVP_sha256(). This should avoid the locks in EVP_DigestInit_ex() and you'll avoid the need to copy the context which is I suppose a slower workaround.
Thanks, this is a better idiomatic workaround than D, though it runs in the same (2x slower) time as D.
Anyway it is quite possible that the low-level call will still be slightly faster but there is no way around that.
So it seems to save one of two polar bears, I would eventually have to copy-paste the low-level OpenSSL code, or expend energy to integrate and test some additional 3rd party crypto library.
Why not just undo the deprecation?
Is this a debug build?
This slow down makes sense. You are doing init/update/final triads on small buffers. The underlying SHA code is the same so you are avoiding the setup and tear down overheads associated with the EVP layer which aren't cheap. If you use a large buffer for the SHA operation, the difference will be significanlty less noticable.
Something like this might be slightly faster as it avoids some of the overhead:
memset(dst, 0, sizeof(dst));
t = clock();
EVP_MD_CTX *basectx = EVP_MD_CTX_create();
EVP_MD_CTX *mdctx = EVP_MD_CTX_create();
const EVP_MD *md = EVP_MD_fetch(NULL, "SHA256", NULL);
EVP_DigestInit_ex(basectx, md, NULL); // ex or ex2
for (long i = 0; i < N; i++) {
EVP_MD_CTX_copy_ex(mdctx, basectx);
EVP_DigestUpdate(mdctx, src, sizeof(src));
EVP_DigestFinal_ex(mdctx, dst, 0);
}
EVP_MD_CTX_destroy(mdctx);
EVP_MD_CTX_destroy(basectx);
cout << "\nC: EVP_xxx " << (float)(clock()-t)/CLOCKS_PER_SEC << 's' << endl;
cout << "check " << ((unsigned long*)dst)[0] << endl;
but it still not going to be as good.
Thanks @paulidale, your use of EVP_MD_CTX_copy_ex makes test case D about 15% faster than my original use of EVP_MD_CTX_copy.
If you use a large buffer for the SHA operation, the difference will be significanlty less noticable.
Small hashes are commonly used for signatures and Merkle trees.
but it still not going to be as good
So does it make sense to deprecate and remove the API that is 1.85x faster, since it will remain internally anyway?
So does it make sense to deprecate and remove the API that is 1.85x faster, since it will remain internally anyway?
As a general rule, I suspect the deprecation makes sense. In this case, it's faster because SHA256 hooks into the underlying assembly implementation. For other digests, this doesn't happen and they'd be slower than using the EVP APIs.
Removal of the API is another matter, we've not set a timeline for that.
Also, with move to the providers and deprecating the low-level calls, OpenSSL main target is not to be a low-level crypto library but an application high level crypto and secure communication library.
It might be an useful project to have a low-level crypto library as a building block on top of which we would build the high level libcrypto. This low-level library could have different API stability guarantees and different release cycles than libcrypto. But it would require quite some effort to have this properly done.
If you use a large buffer for the SHA operation, the difference will be significanlty less noticable.
Re-benchmarking improved code for larger buffers does make the overhead look less bad.
sizeof(src) | SHA256_xxx | EVP_MD_CTX_copy_ex | Regression (EVP/SHA256) | |
---|---|---|---|---|
32 | 0.04521 | 0.07178 | 1.59 | |
55 | 0.04678 | 0.07057 | 1.51 | |
56 | 0.08251 | 0.10731 | 1.30 | |
64 | 0.08393 | 0.10466 | 1.25 | |
128 | 0.11414 | 0.13548 | 1.19 |
128 bytes is not a large buffer.
128 bytes is not a large buffer
If one needs the performance, it is because one is likely either hashing large data or hashing small data very many times.
For truly large buffers actually BLAKE2/3 (or others) may be more useful than SHA256 due to parallelism.
One common reason to hash small data many times is for Merkle tree construction and verification (my use cases here). For an N leaf binary Merkle tree SHA256 is appropriate, given x64 acceleration, but it can take 2N rounds, subject to implementation. That is, for a 4096 leaf tree need 8192 hashing operations, each performed on ~64 byte buffers. This is why the performance of small buffers is important for SHA256.
Another possibility would be to call EVP_DigestInit_ex(mdctx, NULL, NULL)
inside the loop after setting the CTX up outside. This avoids an amount of the repetitive setup work.
Another possibility would be to call EVP_DigestInit_ex(mdctx, NULL, NULL) inside the loop after setting the CTX up outside. This avoids an amount of the repetitive setup work.
This runs slower for me, like test case C.
With EVP_MD_CTX_copy_ex the performance hit for 64 byte buffers is only ~25%. Previously available documentation (e.g. https://wiki.openssl.org/index.php/EVP_Message_Digests) showed only non-repetitive use of the EVP API, and this issue has documented the performant repetitive use.
I set out to remove deprecated calls to SHA256_xxx to replace them with the EVP_Digestxxx equivalent in my code. However it seems the EVP code is slow. So I did a quick test (test case B vs C below), and it is indeed about 5x slower.
Hashing is such a basic common use case, someone else would have noticed, I wonder if I am doing something wrong?
Valgrind tells me the slowness is because EVP_DigestInit_ex does some thread locks internally with CRYPRO_THREAD_read_lock, though I did not request thread safety. It also appears that non-deprecated SHA256() uses EVP internally (test case A) and is just as slow.
As a work around, I have constructed test case D using a second context and EVP_MD_CTX_copy to avoid calling EVP_DigestInit_ex, and this approach is only 2x slower than SHA256_xxx. I can not find a documented way to use EVP to get close to original speed of deprecated SHA256_xxx.
Is this a performance regression?