openssl / openssl

TLS/SSL and crypto library
https://www.openssl.org
Apache License 2.0
25.74k stars 10.11k forks source link

EVP_Digest hashing slower than deprecated SHA256_xxx #19612

Closed CypherGrue closed 1 year ago

CypherGrue commented 1 year ago

I set out to remove deprecated calls to SHA256_xxx to replace them with the EVP_Digestxxx equivalent in my code. However it seems the EVP code is slow. So I did a quick test (test case B vs C below), and it is indeed about 5x slower.

Hashing is such a basic common use case, someone else would have noticed, I wonder if I am doing something wrong?

Valgrind tells me the slowness is because EVP_DigestInit_ex does some thread locks internally with CRYPRO_THREAD_read_lock, though I did not request thread safety. It also appears that non-deprecated SHA256() uses EVP internally (test case A) and is just as slow.

As a work around, I have constructed test case D using a second context and EVP_MD_CTX_copy to avoid calling EVP_DigestInit_ex, and this approach is only 2x slower than SHA256_xxx. I can not find a documented way to use EVP to get close to original speed of deprecated SHA256_xxx.

Is this a performance regression?

#include <openssl/sha.h>
#include <openssl/evp.h>
#include <string.h>
#include <time.h>
#include <iostream>
using namespace std;

int main()
{
    unsigned char src[32], dst[32];
    clock_t t;
    long N = 1000000;
    memset(src, 0x33, sizeof(src));
    memset(dst, 0, sizeof(dst));

    ////////////////////////////////////////////////////
    // A
    ////////////////////////////////////////////////////
    t = clock();
    for (long i = 0; i < N; i++)
        SHA256(src, sizeof(src), dst);
    cout << "A: SHA256 " << (float)(clock()-t)/CLOCKS_PER_SEC << 's' << endl;
    cout << "check " << ((unsigned long*)dst)[0] << endl;

    ////////////////////////////////////////////////////
    // B
    ////////////////////////////////////////////////////
    memset(dst, 0, sizeof(dst));
    t = clock();
    for (long i = 0; i < N; i++) {
        SHA256_CTX ctx;

        SHA256_Init(&ctx);
        SHA256_Update(&ctx, src, sizeof(src));
        SHA256_Final(dst, &ctx);
    }
    cout << "\nB: SHA256_xxx " << (float)(clock()-t)/CLOCKS_PER_SEC << 's' << endl;
    cout << "check " << ((unsigned long*)dst)[0] << endl;

    ////////////////////////////////////////////////////
    // C
    ////////////////////////////////////////////////////
    memset(dst, 0, sizeof(dst));
    t = clock();
    EVP_MD_CTX *mdctx = EVP_MD_CTX_create();
    const EVP_MD *md = EVP_sha256();

    for (long i = 0; i < N; i++) {
        EVP_DigestInit_ex(mdctx, md, NULL); // ex or ex2
        EVP_DigestUpdate(mdctx, src, sizeof(src));
        EVP_DigestFinal_ex(mdctx, dst, 0);
    }

    EVP_MD_CTX_destroy(mdctx);
    cout << "\nC: EVP_xxx " << (float)(clock()-t)/CLOCKS_PER_SEC << 's' << endl;
    cout << "check " << ((unsigned long*)dst)[0] << endl;

    ////////////////////////////////////////////////////
    // D
    ////////////////////////////////////////////////////
    memset(dst, 0, sizeof(dst));
    t = clock();
    mdctx = EVP_MD_CTX_create();
    md = EVP_sha256();

    EVP_MD_CTX *mdctx2 = EVP_MD_CTX_create();
    EVP_DigestInit_ex(mdctx, md, NULL);
    EVP_DigestInit_ex(mdctx2, md, NULL);
    for (long i = 0; i < N; i++) {
        EVP_MD_CTX_copy(mdctx, mdctx2);
        EVP_DigestUpdate(mdctx, src, sizeof(src));
        EVP_DigestFinal_ex(mdctx, dst, 0);
    }

    EVP_MD_CTX_destroy(mdctx);
    EVP_MD_CTX_destroy(mdctx2);
    cout << "\nD: EVP_xxx clobber " << (float)(clock()-t)/CLOCKS_PER_SEC << 's' << endl;
    cout << "check " << ((unsigned long*)dst)[0] << endl;
}
] uname -a
Linux system-name-here 5.18.13-051813-generic #202207220940 SMP PREEMPT_DYNAMIC Fri Jul 22 09:44:12 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

] openssl version
OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)

] g++ -g -O2 shabench.cpp -lcrypto && time ./a.out 
shabench.cpp: In function ‘int main()’:
shabench.cpp:33:20: warning: ‘int SHA256_Init(SHA256_CTX*)’ is deprecated: Since OpenSSL 3.0 [-Wdeprecated-declarations]
   33 |         SHA256_Init(&ctx);
      |         ~~~~~~~~~~~^~~~~~
In file included from shabench.cpp:1:
/usr/include/openssl/sha.h:73:27: note: declared here
   73 | OSSL_DEPRECATEDIN_3_0 int SHA256_Init(SHA256_CTX *c);
      |                           ^~~~~~~~~~~
shabench.cpp:34:22: warning: ‘int SHA256_Update(SHA256_CTX*, const void*, size_t)’ is deprecated: Since OpenSSL 3.0 [-Wdeprecated-declarations]
   34 |         SHA256_Update(&ctx, src, sizeof(src));
      |         ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
In file included from shabench.cpp:1:
/usr/include/openssl/sha.h:74:27: note: declared here
   74 | OSSL_DEPRECATEDIN_3_0 int SHA256_Update(SHA256_CTX *c,
      |                           ^~~~~~~~~~~~~
shabench.cpp:35:21: warning: ‘int SHA256_Final(unsigned char*, SHA256_CTX*)’ is deprecated: Since OpenSSL 3.0 [-Wdeprecated-declarations]
   35 |         SHA256_Final(dst, &ctx);
      |         ~~~~~~~~~~~~^~~~~~~~~~~
In file included from shabench.cpp:1:
/usr/include/openssl/sha.h:76:27: note: declared here
   76 | OSSL_DEPRECATEDIN_3_0 int SHA256_Final(unsigned char *md, SHA256_CTX *c);
      |                           ^~~~~~~~~~~~
A: SHA256 0.257387s
check 16015115755526009054

B: SHA256_xxx 0.045193s
check 16015115755526009054

C: EVP_xxx 0.225154s
check 16015115755526009054

D: EVP_xxx clobber 0.089181s
check 16015115755526009054

real    0m0.619s
user    0m0.614s
sys 0m0.004s
t8m commented 1 year ago

Use EVP_MD_fetch(NULL, "SHA256", NULL) instead of using the implicitly fetched EVP_sha256(). This should avoid the locks in EVP_DigestInit_ex() and you'll avoid the need to copy the context which is I suppose a slower workaround.

Anyway it is quite possible that the low-level call will still be slightly faster but there is no way around that.

CypherGrue commented 1 year ago

Use EVP_MD_fetch(NULL, "SHA256", NULL) instead of using the implicitly fetched EVP_sha256(). This should avoid the locks in EVP_DigestInit_ex() and you'll avoid the need to copy the context which is I suppose a slower workaround.

Thanks, this is a better idiomatic workaround than D, though it runs in the same (2x slower) time as D.

Anyway it is quite possible that the low-level call will still be slightly faster but there is no way around that.

So it seems to save one of two polar bears, I would eventually have to copy-paste the low-level OpenSSL code, or expend energy to integrate and test some additional 3rd party crypto library.

Why not just undo the deprecation?

slontis commented 1 year ago

Is this a debug build?

paulidale commented 1 year ago

This slow down makes sense. You are doing init/update/final triads on small buffers. The underlying SHA code is the same so you are avoiding the setup and tear down overheads associated with the EVP layer which aren't cheap. If you use a large buffer for the SHA operation, the difference will be significanlty less noticable.

Something like this might be slightly faster as it avoids some of the overhead:

    memset(dst, 0, sizeof(dst));
    t = clock();
    EVP_MD_CTX *basectx = EVP_MD_CTX_create();
    EVP_MD_CTX *mdctx = EVP_MD_CTX_create();
    const EVP_MD *md = EVP_MD_fetch(NULL, "SHA256", NULL);

    EVP_DigestInit_ex(basectx, md, NULL); // ex or ex2

    for (long i = 0; i < N; i++) {
        EVP_MD_CTX_copy_ex(mdctx, basectx);
        EVP_DigestUpdate(mdctx, src, sizeof(src));
        EVP_DigestFinal_ex(mdctx, dst, 0);
    }

    EVP_MD_CTX_destroy(mdctx);
    EVP_MD_CTX_destroy(basectx);
    cout << "\nC: EVP_xxx " << (float)(clock()-t)/CLOCKS_PER_SEC << 's' << endl;
    cout << "check " << ((unsigned long*)dst)[0] << endl;

but it still not going to be as good.

CypherGrue commented 1 year ago

Thanks @paulidale, your use of EVP_MD_CTX_copy_ex makes test case D about 15% faster than my original use of EVP_MD_CTX_copy.

If you use a large buffer for the SHA operation, the difference will be significanlty less noticable.

Small hashes are commonly used for signatures and Merkle trees.

but it still not going to be as good

So does it make sense to deprecate and remove the API that is 1.85x faster, since it will remain internally anyway?

paulidale commented 1 year ago

So does it make sense to deprecate and remove the API that is 1.85x faster, since it will remain internally anyway?

As a general rule, I suspect the deprecation makes sense. In this case, it's faster because SHA256 hooks into the underlying assembly implementation. For other digests, this doesn't happen and they'd be slower than using the EVP APIs.

Removal of the API is another matter, we've not set a timeline for that.

t8m commented 1 year ago

Also, with move to the providers and deprecating the low-level calls, OpenSSL main target is not to be a low-level crypto library but an application high level crypto and secure communication library.

It might be an useful project to have a low-level crypto library as a building block on top of which we would build the high level libcrypto. This low-level library could have different API stability guarantees and different release cycles than libcrypto. But it would require quite some effort to have this properly done.

CypherGrue commented 1 year ago

If you use a large buffer for the SHA operation, the difference will be significanlty less noticable.

Re-benchmarking improved code for larger buffers does make the overhead look less bad.

sizeof(src) SHA256_xxx EVP_MD_CTX_copy_ex Regression (EVP/SHA256)
32 0.04521 0.07178 1.59  
55 0.04678 0.07057 1.51  
56 0.08251 0.10731 1.30  
64 0.08393 0.10466 1.25  
128 0.11414 0.13548 1.19  
paulidale commented 1 year ago

128 bytes is not a large buffer.

CypherGrue commented 1 year ago

128 bytes is not a large buffer

If one needs the performance, it is because one is likely either hashing large data or hashing small data very many times.

For truly large buffers actually BLAKE2/3 (or others) may be more useful than SHA256 due to parallelism.

One common reason to hash small data many times is for Merkle tree construction and verification (my use cases here). For an N leaf binary Merkle tree SHA256 is appropriate, given x64 acceleration, but it can take 2N rounds, subject to implementation. That is, for a 4096 leaf tree need 8192 hashing operations, each performed on ~64 byte buffers. This is why the performance of small buffers is important for SHA256.

paulidale commented 1 year ago

Another possibility would be to call EVP_DigestInit_ex(mdctx, NULL, NULL) inside the loop after setting the CTX up outside. This avoids an amount of the repetitive setup work.

CypherGrue commented 1 year ago

Another possibility would be to call EVP_DigestInit_ex(mdctx, NULL, NULL) inside the loop after setting the CTX up outside. This avoids an amount of the repetitive setup work.

This runs slower for me, like test case C.

With EVP_MD_CTX_copy_ex the performance hit for 64 byte buffers is only ~25%. Previously available documentation (e.g. https://wiki.openssl.org/index.php/EVP_Message_Digests) showed only non-repetitive use of the EVP API, and this issue has documented the performant repetitive use.