Open DirtYiCE opened 3 years ago
Profiling this, getrn
, __pthread_rwlock_rdlock
, __pthread_rwlock_unlock
take unusual amounts of time (18.85%, 15.79%, 15.65%) in atomic operations. Particularly strange given this test program is single threaded.
In callgrind, ossl_lib_ctx_get_data
takes 20%, so this will be helped by #17116.
Yes the OSSL_LIB_CTX locking is particularly horrific (as mentioned in #17116) and ossl_lib_ctx_get_data
will be a primary path for hitting that code - so its not a big surprise that we're running into problems as a result.
Incorporating my recent performance fixes, as well as my libctx refactor, this drops from 1.315s in 3.0 to 0.982s (1.34x speedup). 1.1 is still 0.038s, so this remains an issue.
1.1 is still 0.038s, so this remains an issue.
I don't see that change any time soon :wink:
https://github.com/openssl/openssl/pull/17881 fixes the lock contention in ossl_lib_ctx_get_data
.
Any chance this will get some attention, or is there any workaround? I am surprised such a massive regression made it into a release.
I have an application where I parse a large number of private keys, and it turns out it is practically unusable in openssl 3.0 and close to unusable in 3.1. This is using python cryptography, but digging down it appears the problem is within openssl itself.
I did some rough benchmarks, and with 3.0 there's a 7000% slowdown. With 3.1 it's "better" with "only" a slowdown of 1000%.
@hannob are you able to provide a synthetic benchmark test?
key.c:
#include <openssl/evp.h>
#include <openssl/pem.h>
int main() {
EVP_PKEY *key;
FILE *f;
int i;
for (i = 0; i < 100000; i++) {
f = fopen("test.key", "r");
if (f == NULL) {
printf("cannot open test.key\n");
return 1;
}
key = PEM_read_PrivateKey(f, NULL, NULL, NULL);
fclose(f);
}
}
test.key:
-----BEGIN RSA PRIVATE KEY-----
MIIBOgIBAAJBAMFcGsaxxdgiuuGmCkVImy4h99CqT7jwY3pexPGcnUFtR2Fh36Bp
oncwtkZ4cAgtvd4Qs8PkxUdp6p/DlUmObdkCAwEAAQJAUR44xX6zB3eaeyvTRzms
kHADrPCmPWnr8dxsNwiDGHzrMKLN+i/HAam+97HxIKVWNDH2ba9Mf1SA8xu9dcHZ
AQIhAOHPCLxbtQFVxlnhSyxYeb7O323c3QulPNn3bhOipElpAiEA2zZpBE8ZXVnL
74QjG4zINlDfH+EOEtjJJ3RtaYDugvECIBtsQDxXytChsRgDQ1TcXdStXPcDppie
dZhm8yhRTTBZAiAZjE/U9rsIDC0ebxIAZfn3iplWh84yGB3pgUI3J5WkoQIhAInE
HTUY5WRj5riZtkyGnbm3DvF+1eMtO2lYV+OuLcfE
-----END RSA PRIVATE KEY-----
Run:
gcc key.c -lcrypto
time ./a.out
On my laptop it's around 0.7 seconds with openssl 1.1, 40 seconds with 3.0 and 7 seconds with 3.1.
Could you also please try 3.2? Do I correctly understand that it's ARM?
With 3.2 you're referring to the current git code? I'll try. (AFAIK there is no 3.2 release or alpha/beta yet, right?) No, this is a normal Intel x86/64bit cpu.
Yes, the current master, sorry for unclarity
Linking against current git master it's ~12 seconds (so it got worse compared to 3.1).
It's quite strange. Are you able to get any profiling data?
It is trivial to use that test case and get profiling data.
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
12.26 8.12 8.12 529000395 0.00 0.00 OPENSSL_sk_value
9.24 14.25 6.12 200000000 0.00 0.00 collect_extra_decoder
5.51 17.90 3.65 204300000 0.00 0.00 ossl_decoder_fast_is_a
4.93 21.17 3.27 50002087 0.00 0.00 ossl_lh_strcasehash
3.98 23.80 2.63 266900000 0.00 0.00 sk_OSSL_DECODER_INSTANCE_value
3.93 26.41 2.60 100000 0.00 0.00 OSSL_DECODER_CTX_add_extra
3.56 28.77 2.36 58607221 0.00 0.00 getrn
2.35 30.32 1.55 200000000 0.00 0.00 sk_OSSL_DECODER_value
2.31 31.86 1.53 208500000 0.00 0.00 OSSL_DECODER_get0_provider
2.07 33.23 1.38 204000000 0.00 0.00 OSSL_PROVIDER_get0_provider_ctx
1.89 34.48 1.25 162000000 0.00 0.00 ossl_decoder_get_number
1.81 35.68 1.20 300008 0.00 0.00 sa_doall
1.78 36.86 1.18 374800000 0.00 0.00 ossl_assert_int
1.63 37.94 1.08 48100000 0.00 0.00 resolve_name
1.52 38.95 1.01 204000000 0.00 0.00 ossl_provider_prov_ctx
1.51 39.95 1.00 50000058 0.00 0.00 ossl_namemap_name2num
1.45 40.91 0.96 58601805 0.00 0.00 OPENSSL_LH_retrieve
1.42 41.85 0.94 50400116 0.00 0.00 ossl_namemap_stored
1.42 42.79 0.94 8300040 0.00 0.00 CRYPTO_UP_REF
1.24 43.62 0.82 50400116 0.00 0.00 ossl_namemap_empty
1.17 44.39 0.78 4000000 0.00 0.00 collect_decoder
1.09 45.11 0.72 8601072 0.00 0.00 OPENSSL_LH_strhash
1.09 45.83 0.72 100000 0.00 0.00 EVP_DecodeUpdate
1.03 46.52 0.69 66832523 0.00 0.00 ossl_tolower
1.00 47.17 0.66 50000861 0.00 0.00 namemap_name2num
0.97 47.81 0.64 66800684 0.00 0.00 ossl_lib_ctx_get_data
0.92 48.42 0.61 68402450 0.00 0.00 CRYPTO_THREAD_unlock
0.83 48.98 0.55 50001129 0.00 0.00 namenum_hash
0.79 49.50 0.53 50000861 0.00 0.00 lh_NAMENUM_ENTRY_retrieve
0.75 50.00 0.50 pthread_rwlock_rdlock
0.72 50.48 0.48 7901304 0.00 0.00 OPENSSL_strcasecmp
0.72 50.96 0.48 800000 0.00 0.00 bin2bn
0.70 51.42 0.47 68100866 0.00 0.00 CRYPTO_THREAD_read_lock
0.70 51.89 0.47 100000 0.00 0.00 sk_OSSL_DECODER_new_null
0.66 52.33 0.44 OSSL_DECODER_get0_properties
0.63 52.74 0.41 OSSL_PROVIDER_get0_dispatch
0.59 53.13 0.39 86900000 0.00 0.00 conv_ascii2bin
0.59 53.52 0.39 700000 0.00 0.00 evp_decodeblock_int
0.59 53.91 0.39 pthread_rwlock_unlock
0.56 54.28 0.37 OSSL_DECODER_get0_name
0.50 54.62 0.33 100070 0.00 0.00 doall_util_fn
0.49 54.94 0.33 45000000 0.00 0.00 collect_decoder_keymgmt
0.48 55.26 0.32 8600000 0.00 0.00 ossl_bsearch
0.47 55.57 0.31 45100000 0.00 0.00 sk_EVP_KEYMGMT_value
0.45 55.87 0.30 8600434 0.00 0.00 ossl_property_string
0.44 56.16 0.29 llseek
0.37 56.41 0.24 68200843 0.00 0.00 ossl_lib_ctx_get_concrete
0.37 56.65 0.24 4300000 0.00 0.00 ossl_decoder_instance_new
0.34 56.88 0.23 26800000 0.00 0.00 do_name
0.30 57.08 0.20 8600000 0.00 0.00 ossl_property_find_property
0.29 57.27 0.19 6300000 0.00 0.00 ossl_property_str
0.27 57.45 0.18 54400098 0.00 0.00 ossl_provider_libctx
0.27 57.62 0.18 300112 0.00 0.00 OPENSSL_sk_pop_free
0.27 57.80 0.18 _int_malloc
0.26 57.98 0.17 300000 0.00 0.00 ossl_lib_ctx_is_default
0.25 58.15 0.17 13700889 0.00 0.00 CRYPTO_zalloc
0.25 58.31 0.17 301584 0.00 0.00 CRYPTO_THREAD_write_lock
0.24 58.47 0.16 1000000 0.00 0.00 asn1_item_embed_d2i
0.24 58.63 0.16 __memset_avx2_unaligned_erms
0.23 58.78 0.15 8300080 0.00 0.00 OSSL_DECODER_free
0.23 58.94 0.15 ossl_toupper
0.23 59.09 0.15 4200000 0.00 0.00 alg_do_each
0.23 59.24 0.15 300000 0.00 0.00 decoder_process
0.20 59.38 0.14 268 0.00 0.00 lh_NAMENUM_ENTRY_error
0.20 59.51 0.13 1800368 0.00 0.00 BIO_gets
0.20 59.63 0.13 100028 0.00 0.00 OPENSSL_LH_num_items
0.20 59.77 0.13 ossl_provider_teardown
0.17 59.88 0.12 9800282 0.00 0.00 OPENSSL_sk_num
0.17 59.99 0.11 15200000 0.00 0.00 property_idx_cmp
0.17 60.10 0.11 12501184 0.00 0.00 CRYPTO_THREAD_run_once
0.17 60.21 0.11 8800405 0.00 0.00 OPENSSL_sk_insert
0.17 60.32 0.11 8800405 0.00 0.00 OPENSSL_sk_push
0.17 60.43 0.11 58 0.00 0.00 ossl_namemap_name2num_n
And these probably help seeing the context.
-----------------------------------------------
2.60 46.86 100000/100000 OSSL_DECODER_CTX_new_for_pkey [8]
[9] 74.6 2.60 46.86 100000 OSSL_DECODER_CTX_add_extra [9]
6.12 32.58 200000000/200000000 collect_extra_decoder [10]
1.55 3.07 200000000/200000000 sk_OSSL_DECODER_value [32]
0.01 2.83 100000/200000 OSSL_DECODER_do_all_provided [30]
0.47 0.00 100000/100000 sk_OSSL_DECODER_new_null [103]
0.05 0.08 5000000/266900000 sk_OSSL_DECODER_INSTANCE_value [25]
0.01 0.07 100000/100000 sk_OSSL_DECODER_pop_free [201]
0.01 0.00 5000000/9400000 OSSL_DECODER_INSTANCE_get_input_type [342]
0.00 0.00 100000/400000 sk_OSSL_DECODER_INSTANCE_num [396]
0.00 0.00 100000/100000 sk_OSSL_DECODER_num [463]
0.00 0.00 100000/5000000 ossl_assert_int [284]
-----------------------------------------------
6.12 32.58 200000000/200000000 OSSL_DECODER_CTX_add_extra [9]
[10] 58.4 6.12 32.58 200000000 collect_extra_decoder [10]
3.61 15.65 202100000/204300000 ossl_decoder_fast_is_a [11]
2.56 3.98 259400000/266900000 sk_OSSL_DECODER_INSTANCE_value [25]
1.35 0.99 200000000/204000000 OSSL_PROVIDER_get0_provider_ctx [42]
0.12 2.02 2100000/4300000 ossl_decoder_instance_new [33]
1.47 0.63 200000000/208500000 OSSL_DECODER_get0_provider [45]
0.04 0.09 1800000/4300000 ossl_decoder_instance_free [120]
0.01 0.03 1900000/1900000 pem2der_newctx [230]
0.01 0.02 300000/2500000 ossl_decoder_ctx_add_decoder_inst [133]
0.00 0.00 2100000/9400000 OSSL_DECODER_INSTANCE_get_input_type [342]
0.00 0.00 100000/100000 epki2pki_newctx [460]
0.00 0.00 100000/100000 spki2typespki_newctx [461]
-----------------------------------------------
0.04 0.17 2200000/204300000 decoder_process <cycle 5> [36]
3.61 15.65 202100000/204300000 collect_extra_decoder [10]
[11] 29.4 3.65 15.82 204300000 ossl_decoder_fast_is_a [11]
1.08 12.98 48100000/48100000 resolve_name [12]
1.25 0.51 162000000/162000000 ossl_decoder_get_number [50]
-----------------------------------------------
1.08 12.98 48100000/48100000 ossl_decoder_fast_is_a [11]
[12] 21.2 1.08 12.98 48100000 resolve_name [12]
0.96 9.41 48100000/50000058 ossl_namemap_name2num [13]
0.90 1.55 48100000/50400116 ossl_namemap_stored [41]
0.16 0.00 48100000/54400098 ossl_provider_libctx [136]
-----------------------------------------------
0.00 0.00 58/50000058 ossl_namemap_name2num_n [177]
0.00 0.02 100000/50000058 evp_is_a [262]
0.04 0.35 1800000/50000058 OSSL_DECODER_is_a [98]
0.96 9.41 48100000/50000058 resolve_name [12]
[13] 16.3 1.00 9.78 50000058 ossl_namemap_name2num [13]
0.66 8.33 50000058/50000861 namemap_name2num [15]
0.45 0.00 50000058/68402450 CRYPTO_THREAD_unlock [89]
0.34 0.00 50000058/68100866 CRYPTO_THREAD_read_lock [106]
-----------------------------------------------
I think @mattcaswell has optimized the decoder speed - or was it a different code path?
This profile is from current master
I remember that Matt has significantly improved the performance in decoding and it should be present in both 3.1 and master. I can't find the exact issue, unfortunately.
And I don't understand why we have a slowdown in master comparing to 3.1
callgrind shows me 30% time spent in ossl_x25519_public_from_private
. It's completely weird if we are speaking about RSA key.
Nevermind, have old files in my folder, sorry for the noise
This reminds me that we had/have code that tries all possible formats and key types, even when we already know which format and key type it is. There are just too many calls to those functions.
I think @mattcaswell has optimized the decoder speed - or was it a different code path?
I didn't optimize the decoder code itself. There were some specific APIs which were making non-optimal use of the decoders which I fixed. But if you don't happen to call those very specific APIs then you won't notice a difference.
Does this function use optimal calls?
The reproducer code that @hannob has supplied is just calling PEM_read_PrivateKey
- so that should be fine. I've not investigated this reproducer further than that.
So can we deal with a slowdown here?
Nothing obvious is springing to mind. But I'll take a look and see if I can come up with anything.
For master compared to 3.1 I'm seeing approximately 5% slow down.
The picture is complicated but a significant part of the slow down between master and 3.1 seems to be due to #18819, which adds a new decoder. Possibly due to a multiplying effect of the number of decoders being considered.
Maybe it's worth prioritize RSA/EC/Edwards curves somehow in decoder iterations?
There may be some structural issues to address before doing that. This collect_extra_decoder
function seems to called 2000 times in a single private key parsing operation.
Poking around the code in the profile, it seems OpenSSL is quadratic in the number of decoders you have, and there are quite a lot of them. I don't fully follow what OSSL_DECODER_CTX_add_extra
is doing, but it seems OpenSSL now models generic "container" formats like PEM, PrivateKeyInfo, etc., as pluggable conversion functions with input/output formats? And then, before it does anything, it looks like OSSL_DECODER_CTX_add_extra
tries to build all possible paths through these conversion functions, matching inputs to outputs?
That first layer of this path-building starts with a set of leaf "decoder instances" and then, for each one (twice, actually), iterates over every "decoder" to find the ones that can chain backwards to that decoder instances input type. That means decoders are tapped on the head 2 decoder_instances decoders times. (And then the next layer adds another O(decoders) iterations. I suspect this would also go exponential if the wrong series of container formats were added to the system.)
I couldn't find any documentation on why this path-builder design was chosen. It is a little odd to me because one shouldn't need to explore the whole tree of paths to parse one input. I think this comes out of OpenSSL working backwards from all possible outputs, rather than working forwards from the one input you have. The way these container formats are designed (and generally when parsing things), the forwards is the natural direction:
PEM_read_PrivateKey
being called, we know the input is PEM. PEM is a type string (BEGIN WHATEVER
) followed by base64-encoded data.RSA PRIVATE KEY
, so you find the provider(s) that support RSA PRIVATE KEY
and ask them to parse the data. They should parse an RSAPrivateKey ASN.1 structure and that's the end of it.PRIVATE KEY
. That means the type is PKCS#8 PrivateKeyInfo. That too is a generic, provider-independent format that, more-or-less, expands to another (type, data) pair. This time the type is an AlgorithmIdentifier.At no point do you need to explore every possible chain of decoders. The only operation you need is "I have an input of format X. Call the function that parses X". Where there need only be a single implementation, that can be a direct function call. Where you wish it to be pluggable, that only needs a simple lookup. A container format decoder is just something that needs to do that dispatch a second time.
Possible patch in #21426
Regardless of my patch I think @davidben makes good points and we do need to look at the design of the decoder subsystem.
@mattcaswell I like your improvement but it adds complexity of the system - and @davidben describes smth that looks to me like a simplification...
@mattcaswell I like your improvement but it adds complexity of the system - and @davidben describes smth that looks to me like a simplification...
Yes, I agree but it's probably more of a 3.3 thing whereas I think we can get my patch into 3.2. It's not completely clear to me if we can do what @davidben suggests without breaking changes/new API. It's at the least a significant refactor. I do think we should do it though.
Yeah, the proposition by @davidben is definitely not something we could get into 3.2. Now the question is whether https://github.com/openssl/openssl/pull/21426 is OK for 3.2 or not, as it is not particularly trivial either.
Yes, I agree but it's probably more of a 3.3 thing whereas I think we can get my patch into 3.2.
Now that 3.2 is released and you all are working on 3.3, I may suggest taking some time to think about the structural issues in the encoder/decoder system. Working backwards and guessing formats is not a performant, secure, or well-defined way to decode an input. Start from the input type that the caller asked you to parse and work forwards.
This is already on our radar of things we might get into 3.3:
Start from the input type that the caller asked you to parse and work forwards.
Unfortunately we will also have to support the wildcard decoding as regressing this is not an option.
Unfortunately we will also have to support the wildcard decoding as regressing this is not an option.
By wildcard decoding, I assume you mean where the caller forgets to tell you the format, and OpenSSL has to try between several overlapping (i.e. ambiguous) formats? Ambiguous parsers are a thoroughly well-studied and well-understood source of security vulnerabilities, not to mention unpredictable behavior (i.e backwards compatibility risks). Hopefully you all carefully analyzed all pairs of supported formats before exposing them to be part of a wildcard decode. E.g. all 32-byte strings are valid raw X25519 keys, but a 32-byte string may also be something else entirely, so raw X25519 can never participate in such a scheme.
Design problems aside, I see no reason why it requires this backwards parse. Even if the caller forgets to tell you the format, that problem only applies at the top level. Once you've decided to try, say, DER PKCS#8 PrivateKeyInfo, or PEM[*], all parsing from then on is unambiguous. You only need to try ambiguous cases and backtrack at the top-level. And, of course, even if you did need to backtrack on the inner layers (again, you do not), that's still no reason to build the graph ahead of time.
You really don't need to pre-explore all possible call graphs like this. It's just overhead, not flexibility.
[*] PEM and DER are not analogous, so a good parser API design would take this into account. PEM contains a type header, so the caller can actually say "I have a PEM private key" and let the library dispatch between "BEGIN RSA PRIVATE KEY" and "BEGIN PRIVATE KEY". Whereas a DER encoding is not necessarily sufficient to identify the structure, so the caller must say "I have a DER RSAPrivateKey" or "I have a DER PrivateKeyInfo".
Design problems aside, I see no reason why it requires this backwards parse.
Yeah, sure. I did not say that supporting this wildcard decode isn't possible with forward parsing. I am just saying we will have to be able to backtrack even to the topmost level.
Now OpenSSL 3.4.0-dev seems to be <2 times slower than 1.1.1w, with the OP code (on a noisy intel x86_64 machine):
hyperfine -r 100 -w 10 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_pem_rsa-ossl3-dev' 'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_pem_rsa-ossl1-dev'
Benchmark 1: LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_pem_rsa-ossl3-dev
Time (mean ± σ): 100.2 ms ± 15.8 ms [User: 98.8 ms, System: 1.5 ms]
Range (min … max): 89.6 ms … 147.7 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_pem_rsa-ossl1-dev
Time (mean ± σ): 51.3 ms ± 13.6 ms [User: 49.7 ms, System: 1.8 ms]
Range (min … max): 40.6 ms … 87.0 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_pem_rsa-ossl1-dev' ran
1.95 ± 0.60 times faster than 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_pem_rsa-ossl3-dev'
hyperfine -r 100 -w 10 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_der-ossl3-dev' 'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_der-ossl1-dev'
Benchmark 1: LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_der-ossl3-dev
Time (mean ± σ): 37.8 ms ± 10.5 ms [User: 36.9 ms, System: 1.4 ms]
Range (min … max): 29.7 ms … 84.5 ms 100 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_der-ossl1-dev
Time (mean ± σ): 31.4 ms ± 8.6 ms [User: 29.9 ms, System: 1.8 ms]
Range (min … max): 13.5 ms … 46.2 ms 100 runs
Summary
'LD_LIBRARY_PATH=/opt/openssl-1.1.1w-dev/lib64 ./parse_der-ossl1-dev' ran
1.20 ± 0.47 times faster than 'LD_LIBRARY_PATH=/opt/openssl-3.4.0-dev/lib64 ./parse_der-ossl3-dev'
Now OpenSSL 3.4.0-dev seems to be <2 times slower than 1.1.1w, with the OP code (on a noisy intel x86_64 machine):
I am afraid this is as fast as it can be with the current design.
Interesting that the der decoding was quite a bit faster than pem. I wonder why that might be?
I observed that base64 decoding (EVP_DecodeUpdate
) was the slowest for the pem case.
I observed that base64 decoding (EVP_DecodeUpdate) was the slowest for the pem case.
You mean that base64 decoding was significantly slower in 3.4 than 1.1.1?
I haven't checked that, what I saw was, in 3.4, the pem decoding case base64 decoding operation was the slowest. Can share profiler output if necessary (don't have access to that right now).
Interesting that the der decoding was quite a bit faster than pem. I wonder why that might be?
My suspicion is the chaining - the der decoding does not chain any decoders, it directly does der2key decoding. Where in the pem decoding case there are pem2der and der2key decoders chained.
It can be an issue. And also we can have a bottleneck on base64 itself as I don't think we have ever optimized it
I am afraid this is as fast as it can be with the current design.
Keep in mind the notes here. The current design is parsing things in the wrong direction and is doing piles and piles of unnecessary work. https://github.com/openssl/openssl/issues/15199#issuecomment-1631038583
Setting aside whether all the design goals were wise, OpenSSL could still have met its design goals with a better implementation. These are, at the end of the day, not very complex formats.
Regarding base64, keep in mind that PEM is often used to carry private keys. We actually switched our base64 decoder to a constant-time one.
We have two sample programs to parse PKCS8 RSA private keys. First to parse one in PEM format:
With OpenSSL 1.1.1k, it takes about 0.017s to execute, while with master (6ef2f71ac70aff99da277be4a554e3b1fe739050) it takes about 1.4s.
Next one to decode a DER formatted private key:
This takes about 0.005s with 1.1.1k, while it takes about 0.44s with master. I'm using x64 linux, with Core i7-8700K.