Investigate high number of calls to ossl_namemap_name2num

hlandau commented 9 months ago

Quite possible we have a severe architectural issue here. We may need to rethink how we use this infrastructure internally (stack-local cache?)

See #428.

hlandau commented 9 months ago

This is a separate task from design a solution, but I do wonder if we could have some kind of stack-local cache.

void do_foo(OSSL_LIB_CTX *libctx)
{
    NAME_CACHE cache;

    NAME_CACHE_init(&cache, libctx);

    // any subsidiary calls to ossl_namemap_name2num() or similar check the cache first
    foo(libctx, &cache);
    bar(libctx, &cache);
    baz(libctx, &cache);

    NAME_CACHE_cleanup(&cache);
}

idea being coarse operations like a TLS handshake use one cache.

hlandau commented 9 months ago

We should also decide if more fundamental rethinks, or performance optimisations (RCU or lockless or so on) for namemap is desirable.

hlandau commented 9 months ago

re: the above sketch: could even be combined to RCU (e.g. an entire TLS handshake could do only one RCU read lock, to fill its cache, then use the NAME_CACHE). NAME_CACHE doesn't have to strictly be on the stack if handshake is resumable (e.g. could be in SSL object). Would reduce RCU read lock cache contention. One option might be to generate a list of name lookups we want in libssl and do all of those lookups in a batch in one lookup call, meaning a single RCU read lock acquisition for the name map. We would need to define that loading/unloading providers during a handshake results in unspecified set of algorithms being used (already the case, I imagine).

beldmit commented 9 months ago

hlandau commented 9 months ago

Aggregated into epic https://github.com/openssl/project/issues/441

nhorman commented 9 months ago

some interesting results for all to ponder:

I've re-implemented the hash table and replaced the old hash table implementation here: https://github.com/nhorman/openssl/tree/new-lhash

the new hash table is optionally reference counted and uses an internal rcu lock to protect its data structures. it also allows for several new features (see the commit log) that allow us to do lookups faster (for instance, we can now store multiple data types safely, allowing for reverse lookups without having to resort to a doall_names iteration over every entry in the namemap hash)

Performance is greatly improved when running the lhash_test in our test suite: 666.16 us to insert and delete 2.5 million entries using the old hash table 223.75 us to insert and delete 2.5 million entries using the new hash table

Thats significant improvement! There are some caveats however: 1) The new hash table is configured to not do reference counting, but is not configured to be lockless 2) The new hash table is able to pre-configure the size of the bucket list at creation time, so as to avoid the periodic expansion and re-hashing of the list as it grows

I consider these to be acceptable in performance measurement, as for many of our hash tables, we can roughly predict how many entries (especially in the namemap) we will have. Note that this pre-configuration does not preclude growth and rehashing in the event that the number of entries inserted exceeds the initial preconfigured size.

Interstingly, when this new hash table was inserted, the handhsake test usage of RWLOCKS were eliminated from the performance profile (which is good), resulting in about a 3% performance improvement (as concluded by the fact that perf showed 3% of our time was spent in CRYPTO_THREAD_read_lock/unlock, which is not present in the new hash table). However, overall performance remained at parity (ranging from the previously mentiond 3% performance gain, to parity in performance)

Doing some more profiling, I can see that when the performance is at parity, we loose that gain again, as the code execution spends more time in other RWLOCKS:

With the old hash table:

-   62.43%     0.01%  handshake.oldlh  handshake.oldlhash                  [.] perflib_create_bare_ssl_connection                                ◆
   - 62.43% perflib_create_bare_ssl_connection                                                                                                   ▒
      - 31.01% 0x7fc8c6dbf37f                                                                                                                    ▒
         - 20.69% 0x7fc8c6dc38e2                                                                                                                 ▒
            - 20.68% ASN1_item_d2i_ex                                                                                                            ▒
               - 20.63% 0x7fc8c6ec9eed                                                                                                           ▒
                  - 20.63% 0x7fc8c6ecba52                                                                                                        ▒
                     - 20.60% 0x7fc8c6ec9eed                                                                                                     ▒
                        - 20.41% 0x7fc8c6ecba52                                                                                                  ▒
                           - 20.03% 0x7fc8c6ec9ad3                                                                                               ▒
                              - 18.17% 0x7fc8c7044921                                                                                            ▒
                                 - 18.16% OSSL_DECODER_CTX_new_for_pkey                                                                          ▒
                                    - 16.65% OSSL_DECODER_CTX_add_extra                                                                          ▒
                                       - 14.37% 0x7fc8c6f5b072                                                                                   ▒
                                          - 5.57% 0x7fc8c6fbd4f5                                                                                 ▒
                                             + 5.56% CRYPTO_THREAD_unlock

And the new hash table

   - 63.66% perflib_create_bare_ssl_connection                                                                                                   ▒
      - 32.36% 0x7f3185dbf37f                                                                                                                    ▒
         - 21.97% 0x7f3185dc38e2                                                                                                                 ▒
            - 21.97% ASN1_item_d2i_ex                                                                                                            ▒
               - 21.92% 0x7f3185ec9eed                                                                                                           ▒
                  - 21.91% 0x7f3185ecba52                                                                                                        ▒
                     - 21.88% 0x7f3185ec9eed                                                                                                     ▒
                        - 21.70% 0x7f3185ecba52                                                                                                  ▒
                           - 21.33% 0x7f3185ec9ad3                                                                                               ▒
                              - 19.44% 0x7f3186044921                                                                                            ▒
                                 - 19.43% OSSL_DECODER_CTX_new_for_pkey                                                                          ▒
                                    - 17.96% OSSL_DECODER_CTX_add_extra                                                                          ▒
                                       - 15.57% 0x7f3185f5b072                                                                                   ▒
                                          - 5.86% 0x7f3185fbd4f5                                                                                 ▒
                                             + 5.85% CRYPTO_THREAD_unlock

This is exemplary, there are other instances of other apis that spend more time in CRYPTOTHREAD[read|write]_lock/unlock as well.

I believe what this is telling me is that I should continue down this path iteratively, to see if there are other places where an rcu lock may continue to provide more permanent performance gains

hlandau commented 9 months ago

Great work.

Eventually it would be nice to create a new hash table for internal use which offers a more modern design than LHASH. But not right now

mattcaswell commented 9 months ago

                             - 19.43% OSSL_DECODER_CTX_new_for_pkey                                                                          ▒
                                - 17.96% OSSL_DECODER_CTX_add_extra                                                                          ▒
                                   - 15.57% 0x7f3185f5b072                                                                                   ▒
                                      - 5.86% 0x7f3185fbd4f5                                                                                 ▒
                                         + 5.85% CRYPTO_THREAD_unlock

Hmm. this read lock occurs in the event of a cache miss while looking for a suitable template OSSL_DECODER_CTX. I would not expect cache misses to occur after the first iteration of the handshake test has completed - everything should be in the cache by that point. So this is an unexpected result. There is a read lock on the cache itself (in OSSL_DECODER_CTX_new_for_pkey) - which could be optimised to use RCU. But that isn't what this is showing us.

beldmit commented 9 months ago

@nhorman may I ask you to compare your implementation with the current 3.2 for the case from https://github.com/openssl/openssl/issues/20286?

nhorman commented 9 months ago

@hlandau the hash table.in the branch I referenced above is exactly such a new implementation

nhorman commented 9 months ago

@beldmit yes, of course. Will do so today

nhorman commented 9 months ago

@mattcaswell yeah, scratch my previous comment, I wasn't able to reproduce it, apologies.

However I have noticed something odd about successive runs in my environment. I've seen increases in cpu cycles in my hash table implementation in sha512_block_data_order_avx and some of the evp_md code. On a lark I profiled the handshake app to record l1 cache misses in the cpu, and found a large number of cache misses in both of those code paths. Running pahole on libcrypto revealed two things: 1) struct evp_md_st had an 8 byte hole in the middle of it, which IIRC was spilling it into an extra cache line 2) SHA512_CTX has two arrays which were 8 byte aligned. While thats fine, several of the SIMD instructions our asm uses prefer 32 byte alignment. its not required, but if its not present, an exception is generated to fix up the load, slowing things down.

I'm starting to wonder if we haven't been paying much attention to alignment, and as such, our performance changes are paritally impacted by cache misses introduced by various changes to data structure layout.

I'm retesting now with some changes to (1) and (2) above to fixup the cacheline holes and alignment. Will report here shortly

nhorman commented 9 months ago

Thats a bit more like it.

Making those small changes to the evp_md_st and SHA512_CTX structs resulted in a consistent performance gain, ranging from 0.1% - 0.7%. Thats not a great deal of improvement, but its consistent (multiple averaged runs of the handshake test produce the same results). There are other data structures that could use this treatment. In addition to the hash table replacement, I'd like to see how much we can optimize our data structures

Number of Cores	master us/handshake	master handshakes	updates us/handshake	updates handshakes
2	1371	1448	1337	1450
4	1426	2764	1417	2800
8	1537	5159	1534	5153
16	1864	8352	1854	8370
32	2646	11313	2610	11400
64	5397	11515	5375	11559

wtarreau commented 9 months ago

I would not expect cache misses to occur after the first iteration of the handshake test has completed - everything should be in the cache by that point.

It depends if the lock is stored in the same cache line as the data it protects (common mistake). It's faster when you have a single thread, and gets immediately worse when facing concurrency because other threads hammering the lock are in fact taking ownership of the cache line and causing L1 misses to the other thread that was expecting to touch the data.

nhorman commented 9 months ago

I think @mattcaswell was referring to the software cache that openssl uses to to store algorithms fetched from providers, not the cpu L1 cache, but you make a good point, its entirely possible that the lock which the decoder users to protect said cache isn't in the same line that stores any of the data it protects, let me look at that.

wtarreau commented 9 months ago

OK. But it's really preferable that locks are not in the same cache line as the data they protect, for the reason explained above.

nhorman commented 9 months ago

Given @wtarreau comments, I did some perf tracing on cache misses, and noted that EVP_MD_up_ref and EVP_MD_free are high on the list for cache contention in the handshake test. Moving the refcount variable in that struct from the middle to the end, so that it is on its own cache line results in a consistent %1 performance increase. With the new hash table that nets us about a 2-3% increase.

nhorman commented 9 months ago

@t8m taking your suggestion I hacked up openssl to just not do any locking. Doing so nets us no real performance increase, suggesting to me that locks alone are not really responsible for any significant performance impact

That said, i have noted that we spend lots of cycles operating inside loops iterating over hash tables (i.e. xxx_doall_names). Looking at some of the callers, this is used often not to modify each entry in a hash table, but rather to do a reverse style lookup of an element (i.e. map a number to a name in the namemap. Of particular note was the EVL_KEYMGMT_do_all_names call, which searches for a keymgmt alg that matches a given name. ASugmenting the new hash table to end the iteration when such a match is found reduces our iteration count significantly, and nets us another 1% increase

beldmit commented 9 months ago

Let me remind about this comment: https://github.com/openssl/openssl/issues/20286#issuecomment-1436570944

We have too many lookup calls - ossl_decoder_is_a is called 153 mln times on 65536 key iterations and consumes ~20% of time in itself and descendants including ossl_namemap_name2num (13% with descendants). Locks are invisible on the call (attached to the comment). We just call relatively cheap functions too many times...

hlandau commented 9 months ago

There is a separate issue for investigation of the doall calls: https://github.com/openssl/project/issues/442

These are highly suspicious to me. Any design where we are calling a namemap's doall function in a "hot" (really, any post-provider-load/post-fetch) path is deeply suspicious to me and should probably be fixed.

hlandau commented 9 months ago

Of course we know that there are serious problems with the current decoder architecture. https://github.com/openssl/project/issues/100 Anything that can improve on our performance there would be great.

nhorman commented 9 months ago

FWIW, I've experimented in the new hash table with 2 things: 1) a doall_until and filter function that aborts the iteration when the callback returns 0. We have several cases in which we iterate over the hash table looking for a single value (or small set of values). This reduces our iteration count by about 50%, and helps, but not as much as I would like 2) support for multiple data types - ie we can translate do_all lookups into lookups using a reverse key. This I think will help alot, but not until we eliminate all/most of the do_all code, as adding in the additional elements for a reverse lookup just expands the exec time of the remaining doall calls

nhorman commented 9 months ago

Some more performance info. Currently, in addition to the new hashtable implementation, I've expanded it to include a doall_until api call, which behaves identically to the old doall function, except that the callback returns an integer indicating that the iterator can terminate. We have several API points that effectively do a reverse lookup in a hash table by scanning the whole table looking for a single entry. With the _until variant, we can reduce the number of iterations significantly for these calls. Results of the handshake test

master(baseline) threads	usec/iteration	total handshakes
2	1361	1462
4	1420	2767
8	1535	5153
16	1866	8356
32	2655	11340
64	5415	11496

With improvements threads	usec/iteration	total handshakes
2	1368	1454
4	1403	2803
8	1528	5167
16	1836	8504
32	2609	11541
64	5228	11881

This represents a improvement of 0% (parity) at low thread counts to 3.3% at higher thread counts

nhorman commented 9 months ago

I think I found something interesting:

In trying to focus on the high number of ossl_namemap_name2num calls during the handshake test, I sampled some of the backtraces during the middle of a run using a high thread count, and noted that several of them were coming from a path headed by EVP_KDF_fetch. It appears that during an SSL handshake we frequently (very frequently) call EVP_KDF_fetch, which calls through ossl_namemap_name2num to get the kdf object, after which we initalize it within a context, then immediately free it, when we could be reusing it. I hacked up the code, to cache the kdf pointer in the SSL structure, freeing it only when SSL_free was called, and immediately saw a performance benefit of approximate 0.6%. This appears to be a usage pattern in SSL that we might be able to take advantage of with other objects.

nhorman commented 9 months ago

ok, I can confirm that by caching the EVP_KDF allocated in tls13_hkdf_expand_ex in the SSL_CONNECTION object, saves us 3.35 million calls to EVP_KDF_fetch (and transitively that many calls to ossl_namemap_name2num) (13906001 calls without the caching vs 10549836 with the caching). this was determined using the handshake test running 32 threads

nhorman commented 9 months ago

Continuing on this, if I hack up the hkdf provider to save the last used digest and reuse it if the DIGEST parameter is the same, I save a total of 6857312 calls to ossl_namemap_name2num! (13943693 vs 7086381 calls), and at high thread counts I achieve a 10% performance improvement!

I think we need to focus on locations in the code where evp objects are fetched and not reused.

mattcaswell commented 9 months ago

For other algs we cache in the SSL_CTX rather than the SSL, which means we don't even need to do the fetches for each connection - we can just do them once.

This has the unfortunate side effect of making SSL_CTX creation more expensive, which we've discussed before. That was originally considered ok, but now my thinking is that we should try to move these caches to the OSSL_LIB_CTX instead.

t8m commented 9 months ago

This has the unfortunate side effect of making SSL_CTX creation more expensive, which we've discussed before. That was originally considered ok, but now my thinking is that we should try to move these caches to the OSSL_LIB_CTX instead.

Some of the cached fetches are inside the provider - i.e. the digest used by hkdf - we would have to duplicate the EVP_KDF_CTX with the digest set and cache that.

wtarreau commented 9 months ago

FWIW, about a year ago or so, I tried something a bit dirty on 3.0. I noticed that many lookup functions were almost always called with the same argument, so I just implemented a thread-local static cache in them to be able to return the previous entry if the content was the same, something like:


    foo *bar_lookup(const bar *bar)
    {
        static __thread const bar *last_lookup;
        static __thread foo *last_return;

        if (bar == last_lookup)
           return last_return;
        ...
    }

This mananged to significantly reduce the amount of locking, indicating that it was much less often going down the lookup chain. Of course this cannot be used just like this, the looked up key content needs to be checked as well, but you get the idea: looking for "RSA" 1 million times a second is not reasonable for example, and it'd better be cached most of the time (that's an example, I don't remember what was being looked up). I also noticed that some functions were involved in a lot of lookups of various keys and did not sufficiently benefit from this.

Unfortunately I don't remember anymore which functions were concerned, and I didn't dig very far, but it's just to illustrate some inexpensive principles that can sometimes help a lot when there's excess of lookups.

nhorman commented 9 months ago

Thats effectively what I did in the digest, I used a global lru list, which is not going to work long term, need to clean this all up

nhorman commented 9 months ago

Ok, good results. I lost a little performance as my hack took some shortcuts that were decidedly going to raise the suspicions of asan and threadsan, but its still quite good.

I've got a new working branch here, based off my new hash table: https://github.com/nhorman/openssl/tree/cache-algs

This branch does the following: 1) Introduces the new rcu based hash table (as the prior new-lhash branch did) 2) converts the ossl_namemap code to use (1) 3) Does some optimizations in various lookup paths to reduce the iteration count when ossl_namemap_doall_names is called 4) Creates a new hashtable in the OSSL_LIB_CTX called algcache 5) Adds OSSL_LIB_CTXget[md|kdf] calls to allow for the caching of KDF and MD objects returned from their respective fetchers 6) Augments the tls1.3 code to use (5) , reducing the number of calls to EVP_KDF_fetch significantly, and the corresponding number of calls to ossl_namemap_name2num (by about 3.5 million calls in the handshake test at 32 threads) 7) Updates the ossl_prov_digest_load_from_params code to use the same context based caching to store fetched MD's

Note: We're leaving some performance on the table here, as this can be expanded to cover other EVP objects as well. I'll be looking at that going forward

Results with these changes:	threads	old usecs	old handshakes	new usecs	new handshakes
2	1360	1457	1356	1469	0%
4	1406	2792	1381	2843	1.8%
8	1540	5112	1492	5310	3.2%
16	1872	8341	1717	9009	9.0%
32	2661	11307	2448	12304	8.7%
64	5371	11596	4841	12856	10.9%

Note: Performance improvement is measured using the usecs value (ie. new usecs - old usecs / old_usecs). Improvement is better if you measure the total number of handshakes completed.

Note: old (baseline) values are based on the current head of the master branch

At lower thread counts we get no benefit, which isn't overly suprising, as we're spending most of our time constructing the ssl connection anyway, but as our thread counts increase we very quickly approach an 8-10% gain in performance

I think this is probably a good time to take stock, and start considering inclusion of some of this code. I'd like to propose it in 3 steps: 1) Integrate the new hash table 2) convert the namemap to use (1) 3) introduce the context algorithm cache

Thoughts?

hlandau commented 9 months ago

Sounds like a good plan. I'll take a look at some of this in more detail later.

t8m commented 9 months ago

Adds OSSL_LIB_CTXget[md|kdf] calls to allow for the caching of KDF and MD objects returned from their respective fetchers

Augments the tls1.3 code to use (5) , reducing the number of calls to EVP_KDF_fetch significantly, and the corresponding number of calls to ossl_namemap_name2num (by about 3.5 million calls in the handshake test at 32 threads)

I do not think these are necessary. This should be something that is built inside the EVP_*_fetch() functions.

nhorman commented 9 months ago

Is that going to work as well though? i.e. when built into a libctx, the cache gets freed when the context is freed, so if we load a provider that changes the origin of of a given EVP_*, deleting and recreating a context gets the implementation from the new provider, whereas storing these caches with the EVP libraries implies they will stick around in perpituity.

t8m commented 9 months ago

No, it would be still stored in libctx. There is no such thing as "EVP libraries" - it is just a (large) set of similarly named functions that are all implemented in crypto/evp but they all still operate within a libctx.

nhorman commented 9 months ago

Ah, you're saying that the inserts and lookups for cached objects should be done within the EVP_*_fetch call? yeah, that could work. Once I get the hash table PR sorted, I'll revisit this and make that change.

nhorman commented 9 months ago

FWIW, I've made some significant adjustments to the hashtable implementation (using a hopscotch lookup rather than a chained lookup, with neighborhoods aligned on cacheline boundaries). I've also reworked the locking approach based on some comments received (not published yet), but with just the new updated hashtable being used in the namemap, I see this performance profile:	threads	old usec	old handshakes	new usec	new handshakes
2	676	1463	659	1496	2.5700%
4	1057	2783	1033	2839	2.3200%
8	1350	5107	1318	5245	2.4200%
16	1765	8236	1679	8699	5.1200%
32	2585	11172	2463	11645	4.9500%
64	5317	11537	5091	12036	4.4300%

So with just the table replacement alone I see between 2-5% performance benefit. With that alongside a context based alg cache, I think we will see some significant gains

nhorman commented 8 months ago

So, some more numbers, and they're quite startling. I've been focused on the handshake test primarily as thats what we started this issue with, but looking at the other perf tests that we have, performance improvement seems...very much significantly better.

Test methodology: On my system here, I'm running the pemread, handshake, rsasign, providerdoall and sslnew tests using this script:

#!/bin/sh
#

CMD=$1
BASELIB=$2
NEWLIB=$3
ITERATIONS=$4

EXTRARG=--terse
if [ "$CMD" == "./handshake" ]
then
    EXTRARG=/home/nhorman/git/openssl/test/certs/
fi

BASERESULTS=$(mktemp /tmp/base.XXXXXX)
NEWRESULTS=$(mktemp /tmp/new.XXXXXX)

finish() {
    rm -f $BASERESULTS
    rm -f $NEWRESULTS
}

trap finish EXIT
do_test() {
    local libpath=$1
    local results=$2
    export LD_LIBRARY_PATH=$libpath
    for i in 2 4 8 16 32 64
    do
        let tusecs=0
        let thands=0
        for j in `seq 1 1 $ITERATIONS`
        do
            $CMD $EXTRARG $i > /tmp/results
            if [ "$CMD" == "./handshake" ]
            then
                USEC=$(awk '/Average/ {print $5}' /tmp/results | sed -e"s/us//")
                HAND=$(awk '/Handshake/ {print $4}' /tmp/results)
                tusecs=$(dc -e "$tusecs $USEC + p")
                thands=$(dc -e "$thands $HAND + p")
            else
                USEC=$(cat /tmp/results)
                tusecs=$(dc -e "$tusecs $USEC + p")
            fi
        done
        if [ "$CMD" == "./handshake" ]
        then
            ausec=$(dc -e "$tusecs $ITERATIONS / p")
            ahand=$(dc -e "$thands $ITERATIONS / p")
            echo "$i $ausec $ahand" >> $results
        else
            ausec=$(dc -e "$tusecs $ITERATIONS / p")
            echo "$i $ausec" >> $results
        fi
    done
}
do_test $BASELIB $BASERESULTS
do_test $NEWLIB $NEWRESULTS

if [ "$CMD" == "./handshake" ]
then
    echo "|threads|old usec|old handshakes|new usec|new handshakes| perf delta|" > ./$CMD-results.dat
    echo "|-------|--------|--------------|--------|--------------|-----------|" >> ./$CMD-results.dat
    for i in 2 4 8 16 32 64
    do
        OLD_USEC=$(awk "/^$i / {print \$2}" $BASERESULTS)
        OLD_HAND=$(awk "/^$i / {print \$3}" $BASERESULTS)

        NEW_USEC=$(awk "/^$i / {print \$2}" $NEWRESULTS)
        NEW_HAND=$(awk "/^$i / {print \$3}" $NEWRESULTS)

        DELTA=$(dc -e "4 k $OLD_USEC $NEW_USEC - $NEW_USEC / 100 * p")
        echo "|$i|$OLD_USEC|$OLD_HAND|$NEW_USEC|$NEW_HAND|$DELTA%" >> ./$CMD-results.dat
    done
else
    echo "|threads|old usec|new usec| perf delta|" > ./$CMD-results.dat
    echo "|-------|--------|--------|-----------|" >> ./$CMD-results.dat
    for i in 2 4 8 16 32 64
    do
        OLD_USEC=$(awk "/^$i / {print \$2}" $BASERESULTS)
        NEW_USEC=$(awk "/^$i / {print \$2}" $NEWRESULTS)

        DELTA=$(dc -e "4 k $OLD_USEC $NEW_USEC - $NEW_USEC / 100 * p")
        echo "|$i|$OLD_USEC|$NEW_USEC|$DELTA%" >> ./$CMD-results.dat
    done
fi

It runs the requested command ITERATION times, for 2 4 8 16 32 and 64 threads, collecting the average run time for each thread count, and outputs a graph of the results, including the performance delta (positive values indicating improvement in run time)

I'm using the master branch as a base line, and this branch: https://github.com/nhorman/openssl/tree/handshake-test

As the branch to hold my improvements, which amount to the new hash table, plus a replacement of the existing hash table in the namemap code with the new implementation. Other improvements there include: 1) Using the new hashtable locklessly 2) Storing multiple data types to allow for reverse mappings to avoid table iterations in ossl_namemap_doall_names and ossl_namemap_num2name

The results for the handshake test are above, the remaining results are here: pemread	threads	old usec	new usec
2	13980	11915	17.3300%
4	10198	6825	49.4200%
8	11009	3869	184.5400%
16	11499	3040	278.2500%
32	9882	2661	271.3600%
64	9037	2596	248.1100%

sslnew	threads	old usec	new usec
2	2111	1892	11.5700%
4	1342	1284	4.5100%
8	1305	1207	8.1100%
16	1193	1149	3.8200%
32	978	970	.8200%
64	937	948	-1.1600%

rsasign	threads	old usec	new usec
2	46574	46274	.6400%
4	52035	51647	.7500%
8	65875	66244	-.5500%
16	308334	302090	2.0600%
32	781380	786436	-.6400%
64	1592784	1586288	.4000%

providerdoall	threads	old usec	new usec
2	1018	1430	-28.8100%
4	2306	2942	-21.6100%
8	5081	5956	-14.6900%
16	9882	7529	31.2500%
32	14473	15308	-5.4500%
64	19157	20193	-5.1300%

I need to figure out whats going on with the providerdoall test, as it doesn't appear to hit the namemap at all, so I'm not sure why there are such wild swings in performance there, but pemread and sslnew, if those number are to be relied on, show significant improvements

beldmit commented 8 months ago

May I also ask you to add 1 thread?

nhorman commented 8 months ago

@beldmit as requested, performance numbers including single threaded behavior: handshake	threads	old usec	old handshakes	new usec	new handshakes
1	1353	738	1376	727	-1.6700%
2	1385	1433	1351	1474	2.5100%
4	1452	2724	1391	2835	4.3800%
8	1567	5038	1522	5174	2.9500%
16	1929	7977	1837	8509	5.0000%
32	2847	10878	2602	11508	9.4100%
64	5642	11002	5264	11813	7.1800%

newrawkey	threads	old usec	new usec
1	2154	1372	56.9900%
2	1674	1630	2.6900%
4	1388	1067	30.0800%
8	1129	934	20.8700%
16	1461	837	74.5500%
32	3761	854	340.3900%
64	2618	843	210.5500%

pemread	threads	old usec	new usec
1	16085	15682	2.5600%
2	14617	11544	26.6100%
4	10173	6855	48.4000%
8	10783	3877	178.1200%
16	11216	3200	250.5000%
32	9226	2650	248.1500%
64	9268	2607	255.5000%

providerdoall	threads	old usec	new usec
1	185	175	5.7100%
2	885	1421	-37.7100%
4	1980	2823	-29.8600%
8	4276	5660	-24.4500%
16	8496	8268	2.7500%
32	12414	14618	-15.0700%
64	20799	19623	5.9900%

rsasign	threads	old usec	new usec
1	43523	43742	-.5000%
2	47295	45811	3.2300%
4	52773	51809	1.8600%
8	65140	64826	.4800%
16	300849	295378	1.8500%
32	778707	781526	-.3600%
64	1570353	1563878	.4100%

sslnew	threads	old usec	new usec
1	2930	2773	5.6600%
2	2040	1872	8.9700%
4	1310	1250	4.8000%
8	1215	1222	-.5700%
16	1129	1137	-.7000%
32	962	970	-.8200%
64	968	946	2.3200%

beldmit commented 8 months ago

Many thanks!

nhorman commented 8 months ago

@beldmit no problem. FWIW, in looking at the odd performance profile of the providerdoall test. I reran the test 20 times each for each threadcount 1,2,4,8,16,32,64 and recorded the output, then graphed them with a best fit line:

That output suggests to me that our performance is effectively identical between the baseline master branch and the hashtable/improvements branch (which I would expect in this test, given that it makes no use of the hashtable changes). What does appear to be happening is a large variance in individual run times of this test (possibly related to stochastic thread contention, unsure right now). In either case I think we should probably look at this issue separately,

openssl / project

Investigate high number of calls to ossl_namemap_name2num #440