openssl / project

Tracking of project related issues
2 stars 1 forks source link

Investigate high number of calls to ossl_namemap_name2num #440

Open hlandau opened 9 months ago

hlandau commented 9 months ago

Quite possible we have a severe architectural issue here. We may need to rethink how we use this infrastructure internally (stack-local cache?)

See #428.

hlandau commented 9 months ago

This is a separate task from design a solution, but I do wonder if we could have some kind of stack-local cache.

void do_foo(OSSL_LIB_CTX *libctx)
{
    NAME_CACHE cache;

    NAME_CACHE_init(&cache, libctx);

    // any subsidiary calls to ossl_namemap_name2num() or similar check the cache first
    foo(libctx, &cache);
    bar(libctx, &cache);
    baz(libctx, &cache);

    NAME_CACHE_cleanup(&cache);
}

idea being coarse operations like a TLS handshake use one cache.

hlandau commented 9 months ago

We should also decide if more fundamental rethinks, or performance optimisations (RCU or lockless or so on) for namemap is desirable.

hlandau commented 9 months ago

re: the above sketch: could even be combined to RCU (e.g. an entire TLS handshake could do only one RCU read lock, to fill its cache, then use the NAME_CACHE). NAME_CACHE doesn't have to strictly be on the stack if handshake is resumable (e.g. could be in SSL object). Would reduce RCU read lock cache contention. One option might be to generate a list of name lookups we want in libssl and do all of those lookups in a batch in one lookup call, meaning a single RCU read lock acquisition for the name map. We would need to define that loading/unloading providers during a handshake results in unspecified set of algorithms being used (already the case, I imagine).

beldmit commented 9 months ago

See also https://github.com/openssl/openssl/issues/20286

hlandau commented 9 months ago

Aggregated into epic https://github.com/openssl/project/issues/441

nhorman commented 9 months ago

some interesting results for all to ponder:

I've re-implemented the hash table and replaced the old hash table implementation here: https://github.com/nhorman/openssl/tree/new-lhash

the new hash table is optionally reference counted and uses an internal rcu lock to protect its data structures. it also allows for several new features (see the commit log) that allow us to do lookups faster (for instance, we can now store multiple data types safely, allowing for reverse lookups without having to resort to a doall_names iteration over every entry in the namemap hash)

Performance is greatly improved when running the lhash_test in our test suite: 666.16 us to insert and delete 2.5 million entries using the old hash table 223.75 us to insert and delete 2.5 million entries using the new hash table

Thats significant improvement! There are some caveats however: 1) The new hash table is configured to not do reference counting, but is not configured to be lockless 2) The new hash table is able to pre-configure the size of the bucket list at creation time, so as to avoid the periodic expansion and re-hashing of the list as it grows

I consider these to be acceptable in performance measurement, as for many of our hash tables, we can roughly predict how many entries (especially in the namemap) we will have. Note that this pre-configuration does not preclude growth and rehashing in the event that the number of entries inserted exceeds the initial preconfigured size.

Interstingly, when this new hash table was inserted, the handhsake test usage of RWLOCKS were eliminated from the performance profile (which is good), resulting in about a 3% performance improvement (as concluded by the fact that perf showed 3% of our time was spent in CRYPTO_THREAD_read_lock/unlock, which is not present in the new hash table). However, overall performance remained at parity (ranging from the previously mentiond 3% performance gain, to parity in performance)

Doing some more profiling, I can see that when the performance is at parity, we loose that gain again, as the code execution spends more time in other RWLOCKS:

With the old hash table:

-   62.43%     0.01%  handshake.oldlh  handshake.oldlhash                  [.] perflib_create_bare_ssl_connection                                ◆
   - 62.43% perflib_create_bare_ssl_connection                                                                                                   ▒
      - 31.01% 0x7fc8c6dbf37f                                                                                                                    ▒
         - 20.69% 0x7fc8c6dc38e2                                                                                                                 ▒
            - 20.68% ASN1_item_d2i_ex                                                                                                            ▒
               - 20.63% 0x7fc8c6ec9eed                                                                                                           ▒
                  - 20.63% 0x7fc8c6ecba52                                                                                                        ▒
                     - 20.60% 0x7fc8c6ec9eed                                                                                                     ▒
                        - 20.41% 0x7fc8c6ecba52                                                                                                  ▒
                           - 20.03% 0x7fc8c6ec9ad3                                                                                               ▒
                              - 18.17% 0x7fc8c7044921                                                                                            ▒
                                 - 18.16% OSSL_DECODER_CTX_new_for_pkey                                                                          ▒
                                    - 16.65% OSSL_DECODER_CTX_add_extra                                                                          ▒
                                       - 14.37% 0x7fc8c6f5b072                                                                                   ▒
                                          - 5.57% 0x7fc8c6fbd4f5                                                                                 ▒
                                             + 5.56% CRYPTO_THREAD_unlock  

And the new hash table

   - 63.66% perflib_create_bare_ssl_connection                                                                                                   ▒
      - 32.36% 0x7f3185dbf37f                                                                                                                    ▒
         - 21.97% 0x7f3185dc38e2                                                                                                                 ▒
            - 21.97% ASN1_item_d2i_ex                                                                                                            ▒
               - 21.92% 0x7f3185ec9eed                                                                                                           ▒
                  - 21.91% 0x7f3185ecba52                                                                                                        ▒
                     - 21.88% 0x7f3185ec9eed                                                                                                     ▒
                        - 21.70% 0x7f3185ecba52                                                                                                  ▒
                           - 21.33% 0x7f3185ec9ad3                                                                                               ▒
                              - 19.44% 0x7f3186044921                                                                                            ▒
                                 - 19.43% OSSL_DECODER_CTX_new_for_pkey                                                                          ▒
                                    - 17.96% OSSL_DECODER_CTX_add_extra                                                                          ▒
                                       - 15.57% 0x7f3185f5b072                                                                                   ▒
                                          - 5.86% 0x7f3185fbd4f5                                                                                 ▒
                                             + 5.85% CRYPTO_THREAD_unlock                      

This is exemplary, there are other instances of other apis that spend more time in CRYPTOTHREAD[read|write]_lock/unlock as well.

I believe what this is telling me is that I should continue down this path iteratively, to see if there are other places where an rcu lock may continue to provide more permanent performance gains

hlandau commented 9 months ago

Great work.

Eventually it would be nice to create a new hash table for internal use which offers a more modern design than LHASH. But not right now

mattcaswell commented 9 months ago
                             - 19.43% OSSL_DECODER_CTX_new_for_pkey                                                                          ▒
                                - 17.96% OSSL_DECODER_CTX_add_extra                                                                          ▒
                                   - 15.57% 0x7f3185f5b072                                                                                   ▒
                                      - 5.86% 0x7f3185fbd4f5                                                                                 ▒
                                         + 5.85% CRYPTO_THREAD_unlock   

Hmm. this read lock occurs in the event of a cache miss while looking for a suitable template OSSL_DECODER_CTX. I would not expect cache misses to occur after the first iteration of the handshake test has completed - everything should be in the cache by that point. So this is an unexpected result. There is a read lock on the cache itself (in OSSL_DECODER_CTX_new_for_pkey) - which could be optimised to use RCU. But that isn't what this is showing us.

beldmit commented 9 months ago

@nhorman may I ask you to compare your implementation with the current 3.2 for the case from https://github.com/openssl/openssl/issues/20286?

nhorman commented 9 months ago

@hlandau the hash table.in the branch I referenced above is exactly such a new implementation

nhorman commented 9 months ago

@beldmit yes, of course. Will do so today

nhorman commented 9 months ago

@mattcaswell yeah, scratch my previous comment, I wasn't able to reproduce it, apologies.

However I have noticed something odd about successive runs in my environment. I've seen increases in cpu cycles in my hash table implementation in sha512_block_data_order_avx and some of the evp_md code. On a lark I profiled the handshake app to record l1 cache misses in the cpu, and found a large number of cache misses in both of those code paths. Running pahole on libcrypto revealed two things: 1) struct evp_md_st had an 8 byte hole in the middle of it, which IIRC was spilling it into an extra cache line 2) SHA512_CTX has two arrays which were 8 byte aligned. While thats fine, several of the SIMD instructions our asm uses prefer 32 byte alignment. its not required, but if its not present, an exception is generated to fix up the load, slowing things down.

I'm starting to wonder if we haven't been paying much attention to alignment, and as such, our performance changes are paritally impacted by cache misses introduced by various changes to data structure layout.

I'm retesting now with some changes to (1) and (2) above to fixup the cacheline holes and alignment. Will report here shortly

nhorman commented 9 months ago

Thats a bit more like it.

Making those small changes to the evp_md_st and SHA512_CTX structs resulted in a consistent performance gain, ranging from 0.1% - 0.7%. Thats not a great deal of improvement, but its consistent (multiple averaged runs of the handshake test produce the same results). There are other data structures that could use this treatment. In addition to the hash table replacement, I'd like to see how much we can optimize our data structures

Number of Cores master us/handshake master handshakes updates us/handshake updates handshakes
2 1371 1448 1337 1450
4 1426 2764 1417 2800
8 1537 5159 1534 5153
16 1864 8352 1854 8370
32 2646 11313 2610 11400
64 5397 11515 5375 11559
wtarreau commented 9 months ago

I would not expect cache misses to occur after the first iteration of the handshake test has completed - everything should be in the cache by that point.

It depends if the lock is stored in the same cache line as the data it protects (common mistake). It's faster when you have a single thread, and gets immediately worse when facing concurrency because other threads hammering the lock are in fact taking ownership of the cache line and causing L1 misses to the other thread that was expecting to touch the data.

nhorman commented 9 months ago

I think @mattcaswell was referring to the software cache that openssl uses to to store algorithms fetched from providers, not the cpu L1 cache, but you make a good point, its entirely possible that the lock which the decoder users to protect said cache isn't in the same line that stores any of the data it protects, let me look at that.

wtarreau commented 9 months ago

OK. But it's really preferable that locks are not in the same cache line as the data they protect, for the reason explained above.

nhorman commented 9 months ago

Given @wtarreau comments, I did some perf tracing on cache misses, and noted that EVP_MD_up_ref and EVP_MD_free are high on the list for cache contention in the handshake test. Moving the refcount variable in that struct from the middle to the end, so that it is on its own cache line results in a consistent %1 performance increase. With the new hash table that nets us about a 2-3% increase.

nhorman commented 9 months ago

@t8m taking your suggestion I hacked up openssl to just not do any locking. Doing so nets us no real performance increase, suggesting to me that locks alone are not really responsible for any significant performance impact

That said, i have noted that we spend lots of cycles operating inside loops iterating over hash tables (i.e. xxx_doall_names). Looking at some of the callers, this is used often not to modify each entry in a hash table, but rather to do a reverse style lookup of an element (i.e. map a number to a name in the namemap. Of particular note was the EVL_KEYMGMT_do_all_names call, which searches for a keymgmt alg that matches a given name. ASugmenting the new hash table to end the iteration when such a match is found reduces our iteration count significantly, and nets us another 1% increase

beldmit commented 9 months ago

Let me remind about this comment: https://github.com/openssl/openssl/issues/20286#issuecomment-1436570944

We have too many lookup calls - ossl_decoder_is_a is called 153 mln times on 65536 key iterations and consumes ~20% of time in itself and descendants including ossl_namemap_name2num (13% with descendants). Locks are invisible on the call (attached to the comment). We just call relatively cheap functions too many times...

hlandau commented 9 months ago

There is a separate issue for investigation of the doall calls: https://github.com/openssl/project/issues/442

These are highly suspicious to me. Any design where we are calling a namemap's doall function in a "hot" (really, any post-provider-load/post-fetch) path is deeply suspicious to me and should probably be fixed.

hlandau commented 9 months ago

Of course we know that there are serious problems with the current decoder architecture. https://github.com/openssl/project/issues/100 Anything that can improve on our performance there would be great.

nhorman commented 9 months ago

FWIW, I've experimented in the new hash table with 2 things: 1) a doall_until and filter function that aborts the iteration when the callback returns 0. We have several cases in which we iterate over the hash table looking for a single value (or small set of values). This reduces our iteration count by about 50%, and helps, but not as much as I would like 2) support for multiple data types - ie we can translate do_all lookups into lookups using a reverse key. This I think will help alot, but not until we eliminate all/most of the do_all code, as adding in the additional elements for a reverse lookup just expands the exec time of the remaining doall calls

nhorman commented 9 months ago

Some more performance info. Currently, in addition to the new hashtable implementation, I've expanded it to include a doall_until api call, which behaves identically to the old doall function, except that the callback returns an integer indicating that the iterator can terminate. We have several API points that effectively do a reverse lookup in a hash table by scanning the whole table looking for a single entry. With the _until variant, we can reduce the number of iterations significantly for these calls. Results of the handshake test

master(baseline) threads usec/iteration total handshakes
2 1361 1462
4 1420 2767
8 1535 5153
16 1866 8356
32 2655 11340
64 5415 11496
With improvements threads usec/iteration total handshakes
2 1368 1454
4 1403 2803
8 1528 5167
16 1836 8504
32 2609 11541
64 5228 11881

This represents a improvement of 0% (parity) at low thread counts to 3.3% at higher thread counts

nhorman commented 9 months ago

I think I found something interesting:

In trying to focus on the high number of ossl_namemap_name2num calls during the handshake test, I sampled some of the backtraces during the middle of a run using a high thread count, and noted that several of them were coming from a path headed by EVP_KDF_fetch. It appears that during an SSL handshake we frequently (very frequently) call EVP_KDF_fetch, which calls through ossl_namemap_name2num to get the kdf object, after which we initalize it within a context, then immediately free it, when we could be reusing it. I hacked up the code, to cache the kdf pointer in the SSL structure, freeing it only when SSL_free was called, and immediately saw a performance benefit of approximate 0.6%. This appears to be a usage pattern in SSL that we might be able to take advantage of with other objects.

nhorman commented 9 months ago

ok, I can confirm that by caching the EVP_KDF allocated in tls13_hkdf_expand_ex in the SSL_CONNECTION object, saves us 3.35 million calls to EVP_KDF_fetch (and transitively that many calls to ossl_namemap_name2num) (13906001 calls without the caching vs 10549836 with the caching). this was determined using the handshake test running 32 threads

nhorman commented 9 months ago

Continuing on this, if I hack up the hkdf provider to save the last used digest and reuse it if the DIGEST parameter is the same, I save a total of 6857312 calls to ossl_namemap_name2num! (13943693 vs 7086381 calls), and at high thread counts I achieve a 10% performance improvement!

I think we need to focus on locations in the code where evp objects are fetched and not reused.

mattcaswell commented 9 months ago

For other algs we cache in the SSL_CTX rather than the SSL, which means we don't even need to do the fetches for each connection - we can just do them once.

This has the unfortunate side effect of making SSL_CTX creation more expensive, which we've discussed before. That was originally considered ok, but now my thinking is that we should try to move these caches to the OSSL_LIB_CTX instead.

t8m commented 9 months ago

This has the unfortunate side effect of making SSL_CTX creation more expensive, which we've discussed before. That was originally considered ok, but now my thinking is that we should try to move these caches to the OSSL_LIB_CTX instead.

Some of the cached fetches are inside the provider - i.e. the digest used by hkdf - we would have to duplicate the EVP_KDF_CTX with the digest set and cache that.

wtarreau commented 9 months ago

FWIW, about a year ago or so, I tried something a bit dirty on 3.0. I noticed that many lookup functions were almost always called with the same argument, so I just implemented a thread-local static cache in them to be able to return the previous entry if the content was the same, something like:


    foo *bar_lookup(const bar *bar)
    {
        static __thread const bar *last_lookup;
        static __thread foo *last_return;

        if (bar == last_lookup)
           return last_return;
        ...
    }

This mananged to significantly reduce the amount of locking, indicating that it was much less often going down the lookup chain. Of course this cannot be used just like this, the looked up key content needs to be checked as well, but you get the idea: looking for "RSA" 1 million times a second is not reasonable for example, and it'd better be cached most of the time (that's an example, I don't remember what was being looked up). I also noticed that some functions were involved in a lot of lookups of various keys and did not sufficiently benefit from this.

Unfortunately I don't remember anymore which functions were concerned, and I didn't dig very far, but it's just to illustrate some inexpensive principles that can sometimes help a lot when there's excess of lookups.
nhorman commented 9 months ago

Thats effectively what I did in the digest, I used a global lru list, which is not going to work long term, need to clean this all up

nhorman commented 9 months ago

Ok, good results. I lost a little performance as my hack took some shortcuts that were decidedly going to raise the suspicions of asan and threadsan, but its still quite good.

I've got a new working branch here, based off my new hash table: https://github.com/nhorman/openssl/tree/cache-algs

This branch does the following: 1) Introduces the new rcu based hash table (as the prior new-lhash branch did) 2) converts the ossl_namemap code to use (1) 3) Does some optimizations in various lookup paths to reduce the iteration count when ossl_namemap_doall_names is called 4) Creates a new hashtable in the OSSL_LIB_CTX called algcache 5) Adds OSSL_LIB_CTXget[md|kdf] calls to allow for the caching of KDF and MD objects returned from their respective fetchers 6) Augments the tls1.3 code to use (5) , reducing the number of calls to EVP_KDF_fetch significantly, and the corresponding number of calls to ossl_namemap_name2num (by about 3.5 million calls in the handshake test at 32 threads) 7) Updates the ossl_prov_digest_load_from_params code to use the same context based caching to store fetched MD's

Note: We're leaving some performance on the table here, as this can be expanded to cover other EVP objects as well. I'll be looking at that going forward

Results with these changes: threads old usecs old handshakes new usecs new handshakes performance improvement
2 1360 1457 1356 1469 0%
4 1406 2792 1381 2843 1.8%
8 1540 5112 1492 5310 3.2%
16 1872 8341 1717 9009 9.0%
32 2661 11307 2448 12304 8.7%
64 5371 11596 4841 12856 10.9%

Note: Performance improvement is measured using the usecs value (ie. new usecs - old usecs / old_usecs). Improvement is better if you measure the total number of handshakes completed.

Note: old (baseline) values are based on the current head of the master branch

At lower thread counts we get no benefit, which isn't overly suprising, as we're spending most of our time constructing the ssl connection anyway, but as our thread counts increase we very quickly approach an 8-10% gain in performance

I think this is probably a good time to take stock, and start considering inclusion of some of this code. I'd like to propose it in 3 steps: 1) Integrate the new hash table 2) convert the namemap to use (1) 3) introduce the context algorithm cache

Thoughts?

hlandau commented 9 months ago

Sounds like a good plan. I'll take a look at some of this in more detail later.

t8m commented 9 months ago
  • Adds OSSL_LIB_CTXget[md|kdf] calls to allow for the caching of KDF and MD objects returned from their respective fetchers

  • Augments the tls1.3 code to use (5) , reducing the number of calls to EVP_KDF_fetch significantly, and the corresponding number of calls to ossl_namemap_name2num (by about 3.5 million calls in the handshake test at 32 threads)

I do not think these are necessary. This should be something that is built inside the EVP_*_fetch() functions.

nhorman commented 9 months ago

Is that going to work as well though? i.e. when built into a libctx, the cache gets freed when the context is freed, so if we load a provider that changes the origin of of a given EVP_*, deleting and recreating a context gets the implementation from the new provider, whereas storing these caches with the EVP libraries implies they will stick around in perpituity.

t8m commented 9 months ago

No, it would be still stored in libctx. There is no such thing as "EVP libraries" - it is just a (large) set of similarly named functions that are all implemented in crypto/evp but they all still operate within a libctx.

nhorman commented 9 months ago

Ah, you're saying that the inserts and lookups for cached objects should be done within the EVP_*_fetch call? yeah, that could work. Once I get the hash table PR sorted, I'll revisit this and make that change.

nhorman commented 9 months ago
FWIW, I've made some significant adjustments to the hashtable implementation (using a hopscotch lookup rather than a chained lookup, with neighborhoods aligned on cacheline boundaries). I've also reworked the locking approach based on some comments received (not published yet), but with just the new updated hashtable being used in the namemap, I see this performance profile: threads old usec old handshakes new usec new handshakes perf delta
2 676 1463 659 1496 2.5700%
4 1057 2783 1033 2839 2.3200%
8 1350 5107 1318 5245 2.4200%
16 1765 8236 1679 8699 5.1200%
32 2585 11172 2463 11645 4.9500%
64 5317 11537 5091 12036 4.4300%

So with just the table replacement alone I see between 2-5% performance benefit. With that alongside a context based alg cache, I think we will see some significant gains

nhorman commented 8 months ago

So, some more numbers, and they're quite startling. I've been focused on the handshake test primarily as thats what we started this issue with, but looking at the other perf tests that we have, performance improvement seems...very much significantly better.

Test methodology: On my system here, I'm running the pemread, handshake, rsasign, providerdoall and sslnew tests using this script:

#!/bin/sh
#

CMD=$1
BASELIB=$2
NEWLIB=$3
ITERATIONS=$4

EXTRARG=--terse
if [ "$CMD" == "./handshake" ]
then
    EXTRARG=/home/nhorman/git/openssl/test/certs/
fi

BASERESULTS=$(mktemp /tmp/base.XXXXXX)
NEWRESULTS=$(mktemp /tmp/new.XXXXXX)

finish() {
    rm -f $BASERESULTS
    rm -f $NEWRESULTS
}

trap finish EXIT
do_test() {
    local libpath=$1
    local results=$2
    export LD_LIBRARY_PATH=$libpath
    for i in 2 4 8 16 32 64
    do
        let tusecs=0
        let thands=0
        for j in `seq 1 1 $ITERATIONS`
        do
            $CMD $EXTRARG $i > /tmp/results
            if [ "$CMD" == "./handshake" ]
            then
                USEC=$(awk '/Average/ {print $5}' /tmp/results | sed -e"s/us//")
                HAND=$(awk '/Handshake/ {print $4}' /tmp/results)
                tusecs=$(dc -e "$tusecs $USEC + p")
                thands=$(dc -e "$thands $HAND + p")
            else
                USEC=$(cat /tmp/results)
                tusecs=$(dc -e "$tusecs $USEC + p")
            fi
        done
        if [ "$CMD" == "./handshake" ]
        then
            ausec=$(dc -e "$tusecs $ITERATIONS / p")
            ahand=$(dc -e "$thands $ITERATIONS / p")
            echo "$i $ausec $ahand" >> $results
        else
            ausec=$(dc -e "$tusecs $ITERATIONS / p")
            echo "$i $ausec" >> $results
        fi
    done
}
do_test $BASELIB $BASERESULTS
do_test $NEWLIB $NEWRESULTS

if [ "$CMD" == "./handshake" ]
then
    echo "|threads|old usec|old handshakes|new usec|new handshakes| perf delta|" > ./$CMD-results.dat
    echo "|-------|--------|--------------|--------|--------------|-----------|" >> ./$CMD-results.dat
    for i in 2 4 8 16 32 64
    do
        OLD_USEC=$(awk "/^$i / {print \$2}" $BASERESULTS)
        OLD_HAND=$(awk "/^$i / {print \$3}" $BASERESULTS)

        NEW_USEC=$(awk "/^$i / {print \$2}" $NEWRESULTS)
        NEW_HAND=$(awk "/^$i / {print \$3}" $NEWRESULTS)

        DELTA=$(dc -e "4 k $OLD_USEC $NEW_USEC - $NEW_USEC / 100 * p")
        echo "|$i|$OLD_USEC|$OLD_HAND|$NEW_USEC|$NEW_HAND|$DELTA%" >> ./$CMD-results.dat
    done
else
    echo "|threads|old usec|new usec| perf delta|" > ./$CMD-results.dat
    echo "|-------|--------|--------|-----------|" >> ./$CMD-results.dat
    for i in 2 4 8 16 32 64
    do
        OLD_USEC=$(awk "/^$i / {print \$2}" $BASERESULTS)
        NEW_USEC=$(awk "/^$i / {print \$2}" $NEWRESULTS)

        DELTA=$(dc -e "4 k $OLD_USEC $NEW_USEC - $NEW_USEC / 100 * p")
        echo "|$i|$OLD_USEC|$NEW_USEC|$DELTA%" >> ./$CMD-results.dat
    done
fi

It runs the requested command ITERATION times, for 2 4 8 16 32 and 64 threads, collecting the average run time for each thread count, and outputs a graph of the results, including the performance delta (positive values indicating improvement in run time)

I'm using the master branch as a base line, and this branch: https://github.com/nhorman/openssl/tree/handshake-test

As the branch to hold my improvements, which amount to the new hash table, plus a replacement of the existing hash table in the namemap code with the new implementation. Other improvements there include: 1) Using the new hashtable locklessly 2) Storing multiple data types to allow for reverse mappings to avoid table iterations in ossl_namemap_doall_names and ossl_namemap_num2name

The results for the handshake test are above, the remaining results are here: pemread threads old usec new usec perf delta
2 13980 11915 17.3300%
4 10198 6825 49.4200%
8 11009 3869 184.5400%
16 11499 3040 278.2500%
32 9882 2661 271.3600%
64 9037 2596 248.1100%
sslnew threads old usec new usec perf delta
2 2111 1892 11.5700%
4 1342 1284 4.5100%
8 1305 1207 8.1100%
16 1193 1149 3.8200%
32 978 970 .8200%
64 937 948 -1.1600%
rsasign threads old usec new usec perf delta
2 46574 46274 .6400%
4 52035 51647 .7500%
8 65875 66244 -.5500%
16 308334 302090 2.0600%
32 781380 786436 -.6400%
64 1592784 1586288 .4000%
providerdoall threads old usec new usec perf delta
2 1018 1430 -28.8100%
4 2306 2942 -21.6100%
8 5081 5956 -14.6900%
16 9882 7529 31.2500%
32 14473 15308 -5.4500%
64 19157 20193 -5.1300%

I need to figure out whats going on with the providerdoall test, as it doesn't appear to hit the namemap at all, so I'm not sure why there are such wild swings in performance there, but pemread and sslnew, if those number are to be relied on, show significant improvements

beldmit commented 8 months ago

May I also ask you to add 1 thread?

nhorman commented 8 months ago
@beldmit as requested, performance numbers including single threaded behavior: handshake threads old usec old handshakes new usec new handshakes perf delta
1 1353 738 1376 727 -1.6700%
2 1385 1433 1351 1474 2.5100%
4 1452 2724 1391 2835 4.3800%
8 1567 5038 1522 5174 2.9500%
16 1929 7977 1837 8509 5.0000%
32 2847 10878 2602 11508 9.4100%
64 5642 11002 5264 11813 7.1800%
newrawkey threads old usec new usec perf delta
1 2154 1372 56.9900%
2 1674 1630 2.6900%
4 1388 1067 30.0800%
8 1129 934 20.8700%
16 1461 837 74.5500%
32 3761 854 340.3900%
64 2618 843 210.5500%
pemread threads old usec new usec perf delta
1 16085 15682 2.5600%
2 14617 11544 26.6100%
4 10173 6855 48.4000%
8 10783 3877 178.1200%
16 11216 3200 250.5000%
32 9226 2650 248.1500%
64 9268 2607 255.5000%
providerdoall threads old usec new usec perf delta
1 185 175 5.7100%
2 885 1421 -37.7100%
4 1980 2823 -29.8600%
8 4276 5660 -24.4500%
16 8496 8268 2.7500%
32 12414 14618 -15.0700%
64 20799 19623 5.9900%
rsasign threads old usec new usec perf delta
1 43523 43742 -.5000%
2 47295 45811 3.2300%
4 52773 51809 1.8600%
8 65140 64826 .4800%
16 300849 295378 1.8500%
32 778707 781526 -.3600%
64 1570353 1563878 .4100%
sslnew threads old usec new usec perf delta
1 2930 2773 5.6600%
2 2040 1872 8.9700%
4 1310 1250 4.8000%
8 1215 1222 -.5700%
16 1129 1137 -.7000%
32 962 970 -.8200%
64 968 946 2.3200%
beldmit commented 8 months ago

Many thanks!

nhorman commented 8 months ago

@beldmit no problem. FWIW, in looking at the odd performance profile of the providerdoall test. I reran the test 20 times each for each threadcount 1,2,4,8,16,32,64 and recorded the output, then graphed them with a best fit line:

image

That output suggests to me that our performance is effectively identical between the baseline master branch and the hashtable/improvements branch (which I would expect in this test, given that it makes no use of the hashtable changes). What does appear to be happening is a large variance in individual run times of this test (possibly related to stochastic thread contention, unsure right now). In either case I think we should probably look at this issue separately,