Open hlandau opened 9 months ago
This is a separate task from design a solution, but I do wonder if we could have some kind of stack-local cache.
void do_foo(OSSL_LIB_CTX *libctx)
{
NAME_CACHE cache;
NAME_CACHE_init(&cache, libctx);
// any subsidiary calls to ossl_namemap_name2num() or similar check the cache first
foo(libctx, &cache);
bar(libctx, &cache);
baz(libctx, &cache);
NAME_CACHE_cleanup(&cache);
}
idea being coarse operations like a TLS handshake use one cache.
We should also decide if more fundamental rethinks, or performance optimisations (RCU or lockless or so on) for namemap is desirable.
re: the above sketch: could even be combined to RCU (e.g. an entire TLS handshake could do only one RCU read lock, to fill its cache, then use the NAME_CACHE). NAME_CACHE doesn't have to strictly be on the stack if handshake is resumable (e.g. could be in SSL object). Would reduce RCU read lock cache contention. One option might be to generate a list of name lookups we want in libssl and do all of those lookups in a batch in one lookup call, meaning a single RCU read lock acquisition for the name map. We would need to define that loading/unloading providers during a handshake results in unspecified set of algorithms being used (already the case, I imagine).
Aggregated into epic https://github.com/openssl/project/issues/441
some interesting results for all to ponder:
I've re-implemented the hash table and replaced the old hash table implementation here: https://github.com/nhorman/openssl/tree/new-lhash
the new hash table is optionally reference counted and uses an internal rcu lock to protect its data structures. it also allows for several new features (see the commit log) that allow us to do lookups faster (for instance, we can now store multiple data types safely, allowing for reverse lookups without having to resort to a doall_names iteration over every entry in the namemap hash)
Performance is greatly improved when running the lhash_test in our test suite: 666.16 us to insert and delete 2.5 million entries using the old hash table 223.75 us to insert and delete 2.5 million entries using the new hash table
Thats significant improvement! There are some caveats however: 1) The new hash table is configured to not do reference counting, but is not configured to be lockless 2) The new hash table is able to pre-configure the size of the bucket list at creation time, so as to avoid the periodic expansion and re-hashing of the list as it grows
I consider these to be acceptable in performance measurement, as for many of our hash tables, we can roughly predict how many entries (especially in the namemap) we will have. Note that this pre-configuration does not preclude growth and rehashing in the event that the number of entries inserted exceeds the initial preconfigured size.
Interstingly, when this new hash table was inserted, the handhsake test usage of RWLOCKS were eliminated from the performance profile (which is good), resulting in about a 3% performance improvement (as concluded by the fact that perf showed 3% of our time was spent in CRYPTO_THREAD_read_lock/unlock, which is not present in the new hash table). However, overall performance remained at parity (ranging from the previously mentiond 3% performance gain, to parity in performance)
Doing some more profiling, I can see that when the performance is at parity, we loose that gain again, as the code execution spends more time in other RWLOCKS:
With the old hash table:
- 62.43% 0.01% handshake.oldlh handshake.oldlhash [.] perflib_create_bare_ssl_connection ◆
- 62.43% perflib_create_bare_ssl_connection ▒
- 31.01% 0x7fc8c6dbf37f ▒
- 20.69% 0x7fc8c6dc38e2 ▒
- 20.68% ASN1_item_d2i_ex ▒
- 20.63% 0x7fc8c6ec9eed ▒
- 20.63% 0x7fc8c6ecba52 ▒
- 20.60% 0x7fc8c6ec9eed ▒
- 20.41% 0x7fc8c6ecba52 ▒
- 20.03% 0x7fc8c6ec9ad3 ▒
- 18.17% 0x7fc8c7044921 ▒
- 18.16% OSSL_DECODER_CTX_new_for_pkey ▒
- 16.65% OSSL_DECODER_CTX_add_extra ▒
- 14.37% 0x7fc8c6f5b072 ▒
- 5.57% 0x7fc8c6fbd4f5 ▒
+ 5.56% CRYPTO_THREAD_unlock
And the new hash table
- 63.66% perflib_create_bare_ssl_connection ▒
- 32.36% 0x7f3185dbf37f ▒
- 21.97% 0x7f3185dc38e2 ▒
- 21.97% ASN1_item_d2i_ex ▒
- 21.92% 0x7f3185ec9eed ▒
- 21.91% 0x7f3185ecba52 ▒
- 21.88% 0x7f3185ec9eed ▒
- 21.70% 0x7f3185ecba52 ▒
- 21.33% 0x7f3185ec9ad3 ▒
- 19.44% 0x7f3186044921 ▒
- 19.43% OSSL_DECODER_CTX_new_for_pkey ▒
- 17.96% OSSL_DECODER_CTX_add_extra ▒
- 15.57% 0x7f3185f5b072 ▒
- 5.86% 0x7f3185fbd4f5 ▒
+ 5.85% CRYPTO_THREAD_unlock
This is exemplary, there are other instances of other apis that spend more time in CRYPTOTHREAD[read|write]_lock/unlock as well.
I believe what this is telling me is that I should continue down this path iteratively, to see if there are other places where an rcu lock may continue to provide more permanent performance gains
Great work.
Eventually it would be nice to create a new hash table for internal use which offers a more modern design than LHASH. But not right now
- 19.43% OSSL_DECODER_CTX_new_for_pkey ▒ - 17.96% OSSL_DECODER_CTX_add_extra ▒ - 15.57% 0x7f3185f5b072 ▒ - 5.86% 0x7f3185fbd4f5 ▒ + 5.85% CRYPTO_THREAD_unlock
Hmm. this read lock occurs in the event of a cache miss while looking for a suitable template OSSL_DECODER_CTX. I would not expect cache misses to occur after the first iteration of the handshake test has completed - everything should be in the cache by that point. So this is an unexpected result. There is a read lock on the cache itself (in OSSL_DECODER_CTX_new_for_pkey) - which could be optimised to use RCU. But that isn't what this is showing us.
@nhorman may I ask you to compare your implementation with the current 3.2 for the case from https://github.com/openssl/openssl/issues/20286?
@hlandau the hash table.in the branch I referenced above is exactly such a new implementation
@beldmit yes, of course. Will do so today
@mattcaswell yeah, scratch my previous comment, I wasn't able to reproduce it, apologies.
However I have noticed something odd about successive runs in my environment. I've seen increases in cpu cycles in my hash table implementation in sha512_block_data_order_avx and some of the evp_md code. On a lark I profiled the handshake app to record l1 cache misses in the cpu, and found a large number of cache misses in both of those code paths. Running pahole on libcrypto revealed two things: 1) struct evp_md_st had an 8 byte hole in the middle of it, which IIRC was spilling it into an extra cache line 2) SHA512_CTX has two arrays which were 8 byte aligned. While thats fine, several of the SIMD instructions our asm uses prefer 32 byte alignment. its not required, but if its not present, an exception is generated to fix up the load, slowing things down.
I'm starting to wonder if we haven't been paying much attention to alignment, and as such, our performance changes are paritally impacted by cache misses introduced by various changes to data structure layout.
I'm retesting now with some changes to (1) and (2) above to fixup the cacheline holes and alignment. Will report here shortly
Thats a bit more like it.
Making those small changes to the evp_md_st and SHA512_CTX structs resulted in a consistent performance gain, ranging from 0.1% - 0.7%. Thats not a great deal of improvement, but its consistent (multiple averaged runs of the handshake test produce the same results). There are other data structures that could use this treatment. In addition to the hash table replacement, I'd like to see how much we can optimize our data structures
Number of Cores | master us/handshake | master handshakes | updates us/handshake | updates handshakes |
---|---|---|---|---|
2 | 1371 | 1448 | 1337 | 1450 |
4 | 1426 | 2764 | 1417 | 2800 |
8 | 1537 | 5159 | 1534 | 5153 |
16 | 1864 | 8352 | 1854 | 8370 |
32 | 2646 | 11313 | 2610 | 11400 |
64 | 5397 | 11515 | 5375 | 11559 |
I would not expect cache misses to occur after the first iteration of the handshake test has completed - everything should be in the cache by that point.
It depends if the lock is stored in the same cache line as the data it protects (common mistake). It's faster when you have a single thread, and gets immediately worse when facing concurrency because other threads hammering the lock are in fact taking ownership of the cache line and causing L1 misses to the other thread that was expecting to touch the data.
I think @mattcaswell was referring to the software cache that openssl uses to to store algorithms fetched from providers, not the cpu L1 cache, but you make a good point, its entirely possible that the lock which the decoder users to protect said cache isn't in the same line that stores any of the data it protects, let me look at that.
OK. But it's really preferable that locks are not in the same cache line as the data they protect, for the reason explained above.
Given @wtarreau comments, I did some perf tracing on cache misses, and noted that EVP_MD_up_ref and EVP_MD_free are high on the list for cache contention in the handshake test. Moving the refcount variable in that struct from the middle to the end, so that it is on its own cache line results in a consistent %1 performance increase. With the new hash table that nets us about a 2-3% increase.
@t8m taking your suggestion I hacked up openssl to just not do any locking. Doing so nets us no real performance increase, suggesting to me that locks alone are not really responsible for any significant performance impact
That said, i have noted that we spend lots of cycles operating inside loops iterating over hash tables (i.e. xxx_doall_names). Looking at some of the callers, this is used often not to modify each entry in a hash table, but rather to do a reverse style lookup of an element (i.e. map a number to a name in the namemap. Of particular note was the EVL_KEYMGMT_do_all_names call, which searches for a keymgmt alg that matches a given name. ASugmenting the new hash table to end the iteration when such a match is found reduces our iteration count significantly, and nets us another 1% increase
Let me remind about this comment: https://github.com/openssl/openssl/issues/20286#issuecomment-1436570944
We have too many lookup calls - ossl_decoder_is_a
is called 153 mln times on 65536 key iterations and consumes ~20% of time in itself and descendants including ossl_namemap_name2num (13% with descendants). Locks are invisible on the call (attached to the comment). We just call relatively cheap functions too many times...
There is a separate issue for investigation of the doall calls: https://github.com/openssl/project/issues/442
These are highly suspicious to me. Any design where we are calling a namemap's doall function in a "hot" (really, any post-provider-load/post-fetch) path is deeply suspicious to me and should probably be fixed.
Of course we know that there are serious problems with the current decoder architecture. https://github.com/openssl/project/issues/100 Anything that can improve on our performance there would be great.
FWIW, I've experimented in the new hash table with 2 things: 1) a doall_until and filter function that aborts the iteration when the callback returns 0. We have several cases in which we iterate over the hash table looking for a single value (or small set of values). This reduces our iteration count by about 50%, and helps, but not as much as I would like 2) support for multiple data types - ie we can translate do_all lookups into lookups using a reverse key. This I think will help alot, but not until we eliminate all/most of the do_all code, as adding in the additional elements for a reverse lookup just expands the exec time of the remaining doall calls
Some more performance info. Currently, in addition to the new hashtable implementation, I've expanded it to include a doall_until api call, which behaves identically to the old doall function, except that the callback returns an integer indicating that the iterator can terminate. We have several API points that effectively do a reverse lookup in a hash table by scanning the whole table looking for a single entry. With the _until variant, we can reduce the number of iterations significantly for these calls. Results of the handshake test
master(baseline) threads | usec/iteration | total handshakes |
---|---|---|
2 | 1361 | 1462 |
4 | 1420 | 2767 |
8 | 1535 | 5153 |
16 | 1866 | 8356 |
32 | 2655 | 11340 |
64 | 5415 | 11496 |
With improvements threads | usec/iteration | total handshakes |
---|---|---|
2 | 1368 | 1454 |
4 | 1403 | 2803 |
8 | 1528 | 5167 |
16 | 1836 | 8504 |
32 | 2609 | 11541 |
64 | 5228 | 11881 |
This represents a improvement of 0% (parity) at low thread counts to 3.3% at higher thread counts
I think I found something interesting:
In trying to focus on the high number of ossl_namemap_name2num calls during the handshake test, I sampled some of the backtraces during the middle of a run using a high thread count, and noted that several of them were coming from a path headed by EVP_KDF_fetch. It appears that during an SSL handshake we frequently (very frequently) call EVP_KDF_fetch, which calls through ossl_namemap_name2num to get the kdf object, after which we initalize it within a context, then immediately free it, when we could be reusing it. I hacked up the code, to cache the kdf pointer in the SSL structure, freeing it only when SSL_free was called, and immediately saw a performance benefit of approximate 0.6%. This appears to be a usage pattern in SSL that we might be able to take advantage of with other objects.
ok, I can confirm that by caching the EVP_KDF allocated in tls13_hkdf_expand_ex in the SSL_CONNECTION object, saves us 3.35 million calls to EVP_KDF_fetch (and transitively that many calls to ossl_namemap_name2num) (13906001 calls without the caching vs 10549836 with the caching). this was determined using the handshake test running 32 threads
Continuing on this, if I hack up the hkdf provider to save the last used digest and reuse it if the DIGEST parameter is the same, I save a total of 6857312 calls to ossl_namemap_name2num! (13943693 vs 7086381 calls), and at high thread counts I achieve a 10% performance improvement!
I think we need to focus on locations in the code where evp objects are fetched and not reused.
For other algs we cache in the SSL_CTX rather than the SSL, which means we don't even need to do the fetches for each connection - we can just do them once.
This has the unfortunate side effect of making SSL_CTX creation more expensive, which we've discussed before. That was originally considered ok, but now my thinking is that we should try to move these caches to the OSSL_LIB_CTX instead.
This has the unfortunate side effect of making SSL_CTX creation more expensive, which we've discussed before. That was originally considered ok, but now my thinking is that we should try to move these caches to the OSSL_LIB_CTX instead.
Some of the cached fetches are inside the provider - i.e. the digest used by hkdf - we would have to duplicate the EVP_KDF_CTX with the digest set and cache that.
FWIW, about a year ago or so, I tried something a bit dirty on 3.0. I noticed that many lookup functions were almost always called with the same argument, so I just implemented a thread-local static cache in them to be able to return the previous entry if the content was the same, something like:
foo *bar_lookup(const bar *bar)
{
static __thread const bar *last_lookup;
static __thread foo *last_return;
if (bar == last_lookup)
return last_return;
...
}
This mananged to significantly reduce the amount of locking, indicating that it was much less often going down the lookup chain. Of course this cannot be used just like this, the looked up key content needs to be checked as well, but you get the idea: looking for "RSA" 1 million times a second is not reasonable for example, and it'd better be cached most of the time (that's an example, I don't remember what was being looked up). I also noticed that some functions were involved in a lot of lookups of various keys and did not sufficiently benefit from this.
Unfortunately I don't remember anymore which functions were concerned, and I didn't dig very far, but it's just to illustrate some inexpensive principles that can sometimes help a lot when there's excess of lookups.
Thats effectively what I did in the digest, I used a global lru list, which is not going to work long term, need to clean this all up
Ok, good results. I lost a little performance as my hack took some shortcuts that were decidedly going to raise the suspicions of asan and threadsan, but its still quite good.
I've got a new working branch here, based off my new hash table: https://github.com/nhorman/openssl/tree/cache-algs
This branch does the following: 1) Introduces the new rcu based hash table (as the prior new-lhash branch did) 2) converts the ossl_namemap code to use (1) 3) Does some optimizations in various lookup paths to reduce the iteration count when ossl_namemap_doall_names is called 4) Creates a new hashtable in the OSSL_LIB_CTX called algcache 5) Adds OSSL_LIB_CTXget[md|kdf] calls to allow for the caching of KDF and MD objects returned from their respective fetchers 6) Augments the tls1.3 code to use (5) , reducing the number of calls to EVP_KDF_fetch significantly, and the corresponding number of calls to ossl_namemap_name2num (by about 3.5 million calls in the handshake test at 32 threads) 7) Updates the ossl_prov_digest_load_from_params code to use the same context based caching to store fetched MD's
Note: We're leaving some performance on the table here, as this can be expanded to cover other EVP objects as well. I'll be looking at that going forward
Results with these changes: | threads | old usecs | old handshakes | new usecs | new handshakes | performance improvement |
---|---|---|---|---|---|---|
2 | 1360 | 1457 | 1356 | 1469 | 0% | |
4 | 1406 | 2792 | 1381 | 2843 | 1.8% | |
8 | 1540 | 5112 | 1492 | 5310 | 3.2% | |
16 | 1872 | 8341 | 1717 | 9009 | 9.0% | |
32 | 2661 | 11307 | 2448 | 12304 | 8.7% | |
64 | 5371 | 11596 | 4841 | 12856 | 10.9% |
Note: Performance improvement is measured using the usecs value (ie. new usecs - old usecs / old_usecs). Improvement is better if you measure the total number of handshakes completed.
Note: old (baseline) values are based on the current head of the master branch
At lower thread counts we get no benefit, which isn't overly suprising, as we're spending most of our time constructing the ssl connection anyway, but as our thread counts increase we very quickly approach an 8-10% gain in performance
I think this is probably a good time to take stock, and start considering inclusion of some of this code. I'd like to propose it in 3 steps: 1) Integrate the new hash table 2) convert the namemap to use (1) 3) introduce the context algorithm cache
Thoughts?
Sounds like a good plan. I'll take a look at some of this in more detail later.
Adds OSSL_LIB_CTXget[md|kdf] calls to allow for the caching of KDF and MD objects returned from their respective fetchers
Augments the tls1.3 code to use (5) , reducing the number of calls to EVP_KDF_fetch significantly, and the corresponding number of calls to ossl_namemap_name2num (by about 3.5 million calls in the handshake test at 32 threads)
I do not think these are necessary. This should be something that is built inside the EVP_*_fetch() functions.
Is that going to work as well though? i.e. when built into a libctx, the cache gets freed when the context is freed, so if we load a provider that changes the origin of of a given EVP_*, deleting and recreating a context gets the implementation from the new provider, whereas storing these caches with the EVP libraries implies they will stick around in perpituity.
No, it would be still stored in libctx. There is no such thing as "EVP libraries" - it is just a (large) set of similarly named functions that are all implemented in crypto/evp but they all still operate within a libctx.
Ah, you're saying that the inserts and lookups for cached objects should be done within the EVP_*_fetch call? yeah, that could work. Once I get the hash table PR sorted, I'll revisit this and make that change.
FWIW, I've made some significant adjustments to the hashtable implementation (using a hopscotch lookup rather than a chained lookup, with neighborhoods aligned on cacheline boundaries). I've also reworked the locking approach based on some comments received (not published yet), but with just the new updated hashtable being used in the namemap, I see this performance profile: | threads | old usec | old handshakes | new usec | new handshakes | perf delta |
---|---|---|---|---|---|---|
2 | 676 | 1463 | 659 | 1496 | 2.5700% | |
4 | 1057 | 2783 | 1033 | 2839 | 2.3200% | |
8 | 1350 | 5107 | 1318 | 5245 | 2.4200% | |
16 | 1765 | 8236 | 1679 | 8699 | 5.1200% | |
32 | 2585 | 11172 | 2463 | 11645 | 4.9500% | |
64 | 5317 | 11537 | 5091 | 12036 | 4.4300% |
So with just the table replacement alone I see between 2-5% performance benefit. With that alongside a context based alg cache, I think we will see some significant gains
So, some more numbers, and they're quite startling. I've been focused on the handshake test primarily as thats what we started this issue with, but looking at the other perf tests that we have, performance improvement seems...very much significantly better.
Test methodology: On my system here, I'm running the pemread, handshake, rsasign, providerdoall and sslnew tests using this script:
#!/bin/sh
#
CMD=$1
BASELIB=$2
NEWLIB=$3
ITERATIONS=$4
EXTRARG=--terse
if [ "$CMD" == "./handshake" ]
then
EXTRARG=/home/nhorman/git/openssl/test/certs/
fi
BASERESULTS=$(mktemp /tmp/base.XXXXXX)
NEWRESULTS=$(mktemp /tmp/new.XXXXXX)
finish() {
rm -f $BASERESULTS
rm -f $NEWRESULTS
}
trap finish EXIT
do_test() {
local libpath=$1
local results=$2
export LD_LIBRARY_PATH=$libpath
for i in 2 4 8 16 32 64
do
let tusecs=0
let thands=0
for j in `seq 1 1 $ITERATIONS`
do
$CMD $EXTRARG $i > /tmp/results
if [ "$CMD" == "./handshake" ]
then
USEC=$(awk '/Average/ {print $5}' /tmp/results | sed -e"s/us//")
HAND=$(awk '/Handshake/ {print $4}' /tmp/results)
tusecs=$(dc -e "$tusecs $USEC + p")
thands=$(dc -e "$thands $HAND + p")
else
USEC=$(cat /tmp/results)
tusecs=$(dc -e "$tusecs $USEC + p")
fi
done
if [ "$CMD" == "./handshake" ]
then
ausec=$(dc -e "$tusecs $ITERATIONS / p")
ahand=$(dc -e "$thands $ITERATIONS / p")
echo "$i $ausec $ahand" >> $results
else
ausec=$(dc -e "$tusecs $ITERATIONS / p")
echo "$i $ausec" >> $results
fi
done
}
do_test $BASELIB $BASERESULTS
do_test $NEWLIB $NEWRESULTS
if [ "$CMD" == "./handshake" ]
then
echo "|threads|old usec|old handshakes|new usec|new handshakes| perf delta|" > ./$CMD-results.dat
echo "|-------|--------|--------------|--------|--------------|-----------|" >> ./$CMD-results.dat
for i in 2 4 8 16 32 64
do
OLD_USEC=$(awk "/^$i / {print \$2}" $BASERESULTS)
OLD_HAND=$(awk "/^$i / {print \$3}" $BASERESULTS)
NEW_USEC=$(awk "/^$i / {print \$2}" $NEWRESULTS)
NEW_HAND=$(awk "/^$i / {print \$3}" $NEWRESULTS)
DELTA=$(dc -e "4 k $OLD_USEC $NEW_USEC - $NEW_USEC / 100 * p")
echo "|$i|$OLD_USEC|$OLD_HAND|$NEW_USEC|$NEW_HAND|$DELTA%" >> ./$CMD-results.dat
done
else
echo "|threads|old usec|new usec| perf delta|" > ./$CMD-results.dat
echo "|-------|--------|--------|-----------|" >> ./$CMD-results.dat
for i in 2 4 8 16 32 64
do
OLD_USEC=$(awk "/^$i / {print \$2}" $BASERESULTS)
NEW_USEC=$(awk "/^$i / {print \$2}" $NEWRESULTS)
DELTA=$(dc -e "4 k $OLD_USEC $NEW_USEC - $NEW_USEC / 100 * p")
echo "|$i|$OLD_USEC|$NEW_USEC|$DELTA%" >> ./$CMD-results.dat
done
fi
It runs the requested command ITERATION times, for 2 4 8 16 32 and 64 threads, collecting the average run time for each thread count, and outputs a graph of the results, including the performance delta (positive values indicating improvement in run time)
I'm using the master branch as a base line, and this branch: https://github.com/nhorman/openssl/tree/handshake-test
As the branch to hold my improvements, which amount to the new hash table, plus a replacement of the existing hash table in the namemap code with the new implementation. Other improvements there include: 1) Using the new hashtable locklessly 2) Storing multiple data types to allow for reverse mappings to avoid table iterations in ossl_namemap_doall_names and ossl_namemap_num2name
The results for the handshake test are above, the remaining results are here: pemread | threads | old usec | new usec | perf delta |
---|---|---|---|---|
2 | 13980 | 11915 | 17.3300% | |
4 | 10198 | 6825 | 49.4200% | |
8 | 11009 | 3869 | 184.5400% | |
16 | 11499 | 3040 | 278.2500% | |
32 | 9882 | 2661 | 271.3600% | |
64 | 9037 | 2596 | 248.1100% |
sslnew | threads | old usec | new usec | perf delta |
---|---|---|---|---|
2 | 2111 | 1892 | 11.5700% | |
4 | 1342 | 1284 | 4.5100% | |
8 | 1305 | 1207 | 8.1100% | |
16 | 1193 | 1149 | 3.8200% | |
32 | 978 | 970 | .8200% | |
64 | 937 | 948 | -1.1600% |
rsasign | threads | old usec | new usec | perf delta |
---|---|---|---|---|
2 | 46574 | 46274 | .6400% | |
4 | 52035 | 51647 | .7500% | |
8 | 65875 | 66244 | -.5500% | |
16 | 308334 | 302090 | 2.0600% | |
32 | 781380 | 786436 | -.6400% | |
64 | 1592784 | 1586288 | .4000% |
providerdoall | threads | old usec | new usec | perf delta |
---|---|---|---|---|
2 | 1018 | 1430 | -28.8100% | |
4 | 2306 | 2942 | -21.6100% | |
8 | 5081 | 5956 | -14.6900% | |
16 | 9882 | 7529 | 31.2500% | |
32 | 14473 | 15308 | -5.4500% | |
64 | 19157 | 20193 | -5.1300% |
I need to figure out whats going on with the providerdoall test, as it doesn't appear to hit the namemap at all, so I'm not sure why there are such wild swings in performance there, but pemread and sslnew, if those number are to be relied on, show significant improvements
May I also ask you to add 1 thread?
@beldmit as requested, performance numbers including single threaded behavior: handshake | threads | old usec | old handshakes | new usec | new handshakes | perf delta |
---|---|---|---|---|---|---|
1 | 1353 | 738 | 1376 | 727 | -1.6700% | |
2 | 1385 | 1433 | 1351 | 1474 | 2.5100% | |
4 | 1452 | 2724 | 1391 | 2835 | 4.3800% | |
8 | 1567 | 5038 | 1522 | 5174 | 2.9500% | |
16 | 1929 | 7977 | 1837 | 8509 | 5.0000% | |
32 | 2847 | 10878 | 2602 | 11508 | 9.4100% | |
64 | 5642 | 11002 | 5264 | 11813 | 7.1800% |
newrawkey | threads | old usec | new usec | perf delta |
---|---|---|---|---|
1 | 2154 | 1372 | 56.9900% | |
2 | 1674 | 1630 | 2.6900% | |
4 | 1388 | 1067 | 30.0800% | |
8 | 1129 | 934 | 20.8700% | |
16 | 1461 | 837 | 74.5500% | |
32 | 3761 | 854 | 340.3900% | |
64 | 2618 | 843 | 210.5500% |
pemread | threads | old usec | new usec | perf delta |
---|---|---|---|---|
1 | 16085 | 15682 | 2.5600% | |
2 | 14617 | 11544 | 26.6100% | |
4 | 10173 | 6855 | 48.4000% | |
8 | 10783 | 3877 | 178.1200% | |
16 | 11216 | 3200 | 250.5000% | |
32 | 9226 | 2650 | 248.1500% | |
64 | 9268 | 2607 | 255.5000% |
providerdoall | threads | old usec | new usec | perf delta |
---|---|---|---|---|
1 | 185 | 175 | 5.7100% | |
2 | 885 | 1421 | -37.7100% | |
4 | 1980 | 2823 | -29.8600% | |
8 | 4276 | 5660 | -24.4500% | |
16 | 8496 | 8268 | 2.7500% | |
32 | 12414 | 14618 | -15.0700% | |
64 | 20799 | 19623 | 5.9900% |
rsasign | threads | old usec | new usec | perf delta |
---|---|---|---|---|
1 | 43523 | 43742 | -.5000% | |
2 | 47295 | 45811 | 3.2300% | |
4 | 52773 | 51809 | 1.8600% | |
8 | 65140 | 64826 | .4800% | |
16 | 300849 | 295378 | 1.8500% | |
32 | 778707 | 781526 | -.3600% | |
64 | 1570353 | 1563878 | .4100% |
sslnew | threads | old usec | new usec | perf delta |
---|---|---|---|---|
1 | 2930 | 2773 | 5.6600% | |
2 | 2040 | 1872 | 8.9700% | |
4 | 1310 | 1250 | 4.8000% | |
8 | 1215 | 1222 | -.5700% | |
16 | 1129 | 1137 | -.7000% | |
32 | 962 | 970 | -.8200% | |
64 | 968 | 946 | 2.3200% |
Many thanks!
@beldmit no problem. FWIW, in looking at the odd performance profile of the providerdoall test. I reran the test 20 times each for each threadcount 1,2,4,8,16,32,64 and recorded the output, then graphed them with a best fit line:
That output suggests to me that our performance is effectively identical between the baseline master branch and the hashtable/improvements branch (which I would expect in this test, given that it makes no use of the hashtable changes). What does appear to be happening is a large variance in individual run times of this test (possibly related to stochastic thread contention, unsure right now). In either case I think we should probably look at this issue separately,
Quite possible we have a severe architectural issue here. We may need to rethink how we use this infrastructure internally (stack-local cache?)
See #428.