Closed ni4 closed 4 months ago
Interesting. When we introduced AlignmentBuffer
to the MD-hash family, @randombit reported a very slight performance degradation which we decided to ignore because it seemed minimal.
I'll try to reproduce your measurements. In the mean time: Any chance you could bisect this down to a specific release? The AlignmentBuffer
was introduced in October 2023 with Botan 3.2.0.
That result was probably for large buffers though (likely the 1K default of speed
) whereas S2K works by hashing short strings over and over again. So if the buffering logic became slower that could slowdown S2K
Good point, thanks. I'll experiment a bit.
Using this test prog (botan speed
doesn't support PGP-S2K)
#include <botan/pgp_s2k.h>
#include <botan/version.h>
uint64_t nanos() {
auto now: duration = std::chrono::system_clock::now().time_since_epoch();
return std::chrono::duration_cast<std::chrono::nanoseconds>(d: now).count();
}
int main() {
using namespace Botan;
OpenPGP_S2K s2k(hash: HashFunction::create_or_throw(algo_spec: "SHA-1"));
uint8_t output[64] = { [0]=0 };
const std::string passphrase = "very secret and everything";
uint8_t salt[16] = { [0]=0 };
const size_t iterations = 10000;
const size_t trials = 10000;
uint64_t time = 0;
for(size_t i = 0; i != trials; ++i) {
uint64_t start = nanos();
s2k.pbkdf(output_buf: output, output_len: sizeof(output),
passphrase,
salt, salt_len: sizeof(salt),
iterations,
msec: std::chrono::milliseconds(rep: 0));
time += nanos() - start;
salt[i % sizeof(salt)] += 1;
}
printf(format: "Botan %d.%d: In %d trials took %.02f ns per trial (%d iterations)\n",
version_major(), version_minor(),
trials, static_cast<double>(time) / trials,
iterations);
}
I'm seeing this
Botan 3.0: In 10000 trials took 158682.69 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 126805.26 ns per trial (10000 iterations) [current master]
The output of /botan speed --buf-size=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 SHA-1
should provide an indicator if we're buffering inefficiently. But I'm not seeing it. The SHA-1 compression function hasn't changed much (and by 24 bytes the performance between 3.0 and master is identical) but for very small inputs latest master actually seems better, rather than worse:
All numbers on a Sandybridge x86 with Clang 18 3.0.0
SHA-1 hash buffer size 1 bytes: 3.662 MiB/sec 563.20 cycles/byte (1.83 MiB in 500.00 ms)
SHA-1 hash buffer size 2 bytes: 7.386 MiB/sec 279.09 cycles/byte (3.69 MiB in 500.00 ms)
SHA-1 hash buffer size 3 bytes: 11.092 MiB/sec 185.83 cycles/byte (5.55 MiB in 500.00 ms)
SHA-1 hash buffer size 4 bytes: 14.795 MiB/sec 139.23 cycles/byte (7.40 MiB in 500.00 ms)
vs master
SHA-1 hash buffer size 1 bytes: 3.990 MiB/sec 510.96 cycles/byte (1.99 MiB in 500.00 ms)
SHA-1 hash buffer size 2 bytes: 7.970 MiB/sec 256.10 cycles/byte (3.99 MiB in 500.00 ms)
SHA-1 hash buffer size 3 bytes: 11.955 MiB/sec 170.73 cycles/byte (5.98 MiB in 500.00 ms)
SHA-1 hash buffer size 4 bytes: 16.017 MiB/sec 127.35 cycles/byte (8.01 MiB in 500.00 ms)
@ni4 Can you provide info on size of passphrase/salt/iterations in use for the test? Also which version of XCode is this? Are you seeing this regression also on Linux? Oh, and is RNP these days calling us still using the C binding (ffi.h
)? Or directly using Botan::PBKDF
or Botan::PasswordHashFamily
(and if so, which one)
From his profiler screenshot I read that the call is coming through the FFI.
#include <botan/ffi.h>
#include <chrono>
uint64_t nanos() {
auto now = std::chrono::system_clock::now().time_since_epoch();
return std::chrono::duration_cast<std::chrono::nanoseconds>(now).count();
}
int main() {
uint8_t output[64] = { 0 };
const std::string passphrase = "very secret and everything";
uint8_t salt[16] = { 0 };
const size_t iterations = 10000;
const size_t trials = 10000;
uint64_t time = 0;
for(size_t i = 0; i != trials; ++i) {
uint64_t start = nanos();
int rc = botan_pwdhash("OpenPGP-S2K(SHA-1)", iterations, 0, 0,
output, sizeof(output),
passphrase.data(), passphrase.size(),
salt, sizeof(salt));
time += nanos() - start;
if(rc != 0) {
printf("rc %d\n", rc);
return 1;
}
salt[i % sizeof(salt)] += 1;
}
printf("Botan %d.%d: In %d trials took %.02f ns per trial (%d iterations)\n",
botan_version_major(), botan_version_minor(),
trials, static_cast<double>(time) / trials,
iterations);
}
Botan 3.0: In 10000 trials took 132142.55 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 121881.86 ns per trial (10000 iterations)
@ni4 Can you run the above program on your machine with 3.0.0 vs 3.6.0 and post the results? (Tuned if necessary so the inputs/params match RNP's usage)
macOS 14.5 on MacBook Air M2, Apple clang 15
Botan 3.0: In 10000 trials took 38501.70 ns per trial (10000 iterations) Botan 3.1: In 10000 trials took 39105.30 ns per trial (10000 iterations) Botan 3.2: In 10000 trials took 40112.70 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 40755.60 ns per trial (10000 iterations)
... there does seem to be a little slowdown of about 5%, in release mode. Mhh, actually not. With multiple runs, it gets lost in the jitter. @ni mentioned he's running it with "enabled debugging", that might inflate this?!
I'll try to bisect over the releases...
@ni4 Also can you post exact configure.py
flags?
Debug mode, perhaps? (./configure.py --build-targets=static,cli,tests --compiler-cache=ccache --werror-mode --minimized-build --enable-modules=sha1,pgp_s2k,sha1_armv8 --debug-mode
)
Botan 3.0: In 10000 trials took 377942.90 ns per trial (10000 iterations) Botan 3.1: In 10000 trials took 382893.70 ns per trial (10000 iterations) Botan 3.2: In 10000 trials took 1064952.30 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 1071368.10 ns per trial (10000 iterations)
... Using --with-debug
this shows a three-fold slowdown. As reported...
--debug-mode
disables all optimizations though!
Maybe the issue is that we've made changes that are "free" with optimization but which cause the compile to emit worse code with no optimizations?
I'm half tempted to remove --debug-mode
and --no-optimizations
flags as footguns
FWIW: The signal in debug mode is big enough to bisect based on it. This commit introduces the slowdown: https://github.com/randombit/botan/commit/1e1b4316a92f1d00e00f22aed76f96ac58ebe3b1
SHA-1 starts using MerkleDamgard_Hash<>
in this commit (before it used the MDx_HashFunction
base class that is now removed). And MerkleDamgard_Hash<>
uses AlignmentBuffer
internally.
Maybe the issue is that we've made changes that are "free" with optimization but which cause the compile to emit worse code with no optimizations?
Most certainly! I can totally imagine that the added abstractions are harming performance if optimizations are disabled. In contrast to the fairly verbatim implementation that was there before. In fact https://github.com/randombit/botan/commit/0ed41ccbcdd95fad66901fb15fcb4b97590198e6 also introduces a measurable slowdown when using --debug-mode
, I'm guessing due to the introduction of BufferSlicer
.
Perhaps @ni4 could use --with-debug-info
instead of --debug-mode
?
Oh, sorry, got distracted from the email so lagged a bit with replies. Definitely --debug-mode
is the reason for this, replacing it with --with-debug-info
speeds things up dramatically. Now the same test runs ~2 seconds instead of 10 or 30.
Thanks for your time and efforts on resolving this! Now this could be closed I guess.
@ni4 No problem at all, thanks for reporting this - unexpected/undocumented performance regressions are definitely something we want to know about.
Recently I discovered that our RNP tests are running much slower then before. Diggint into it, found out that at some point after Botan 3.0, S2K key derivation, based on SHA1 (and possibly other) hash, started to work much slower, partially due to introduction of AlignmentBuffer. Overall, test which was run 11 seconds now runs 31. I'm building Botan from sources, with enabled debugging, on macOS. Please see the sample call stack with information from Instruments attached.
Feel free to ask for any further information.