Significant slowdown of S2K key derivation/SHA1 hashing

ni4 commented 4 months ago

Recently I discovered that our RNP tests are running much slower then before. Diggint into it, found out that at some point after Botan 3.0, S2K key derivation, based on SHA1 (and possibly other) hash, started to work much slower, partially due to introduction of AlignmentBuffer. Overall, test which was run 11 seconds now runs 31. I'm building Botan from sources, with enabled debugging, on macOS. Please see the sample call stack with information from Instruments attached. Screenshot 2024-07-22 at 14 22 32

Feel free to ask for any further information.

reneme commented 4 months ago

Interesting. When we introduced AlignmentBuffer to the MD-hash family, @randombit reported a very slight performance degradation which we decided to ignore because it seemed minimal.

I'll try to reproduce your measurements. In the mean time: Any chance you could bisect this down to a specific release? The AlignmentBuffer was introduced in October 2023 with Botan 3.2.0.

randombit commented 4 months ago

That result was probably for large buffers though (likely the 1K default of speed) whereas S2K works by hashing short strings over and over again. So if the buffering logic became slower that could slowdown S2K

reneme commented 4 months ago

Good point, thanks. I'll experiment a bit.

randombit commented 4 months ago

Using this test prog (botan speed doesn't support PGP-S2K)

#include <botan/pgp_s2k.h>
#include <botan/version.h>

uint64_t nanos() {
   auto now: duration = std::chrono::system_clock::now().time_since_epoch();
   return std::chrono::duration_cast<std::chrono::nanoseconds>(d: now).count();
   }

int main() {
   using namespace Botan;

   OpenPGP_S2K s2k(hash: HashFunction::create_or_throw(algo_spec: "SHA-1"));

   uint8_t output[64] = { [0]=0 };
   const std::string passphrase = "very secret and everything";
   uint8_t salt[16] = { [0]=0 };

   const size_t iterations = 10000;

   const size_t trials = 10000;

   uint64_t time = 0;
   for(size_t i = 0; i != trials; ++i) {
      uint64_t start = nanos();
      s2k.pbkdf(output_buf: output, output_len: sizeof(output),
                passphrase,
                salt, salt_len: sizeof(salt),
                iterations,
                msec: std::chrono::milliseconds(rep: 0));
      time += nanos() - start;

      salt[i % sizeof(salt)] += 1;
   }

   printf(format: "Botan %d.%d: In %d trials took %.02f ns per trial (%d iterations)\n",
          version_major(), version_minor(),
          trials, static_cast<double>(time) / trials,
          iterations);
}

I'm seeing this

Botan 3.0: In 10000 trials took 158682.69 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 126805.26 ns per trial (10000 iterations) [current master]

The output of /botan speed --buf-size=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 SHA-1 should provide an indicator if we're buffering inefficiently. But I'm not seeing it. The SHA-1 compression function hasn't changed much (and by 24 bytes the performance between 3.0 and master is identical) but for very small inputs latest master actually seems better, rather than worse:

All numbers on a Sandybridge x86 with Clang 18 3.0.0

SHA-1 hash buffer size 1 bytes: 3.662 MiB/sec 563.20 cycles/byte (1.83 MiB in 500.00 ms)
SHA-1 hash buffer size 2 bytes: 7.386 MiB/sec 279.09 cycles/byte (3.69 MiB in 500.00 ms)
SHA-1 hash buffer size 3 bytes: 11.092 MiB/sec 185.83 cycles/byte (5.55 MiB in 500.00 ms)
SHA-1 hash buffer size 4 bytes: 14.795 MiB/sec 139.23 cycles/byte (7.40 MiB in 500.00 ms)

vs master

SHA-1 hash buffer size 1 bytes: 3.990 MiB/sec 510.96 cycles/byte (1.99 MiB in 500.00 ms)
SHA-1 hash buffer size 2 bytes: 7.970 MiB/sec 256.10 cycles/byte (3.99 MiB in 500.00 ms)
SHA-1 hash buffer size 3 bytes: 11.955 MiB/sec 170.73 cycles/byte (5.98 MiB in 500.00 ms)
SHA-1 hash buffer size 4 bytes: 16.017 MiB/sec 127.35 cycles/byte (8.01 MiB in 500.00 ms)

randombit commented 4 months ago

@ni4 Can you provide info on size of passphrase/salt/iterations in use for the test? Also which version of XCode is this? Are you seeing this regression also on Linux? Oh, and is RNP these days calling us still using the C binding (ffi.h)? Or directly using Botan::PBKDF or Botan::PasswordHashFamily (and if so, which one)

reneme commented 4 months ago

From his profiler screenshot I read that the call is coming through the FFI.

randombit commented 4 months ago

#include <botan/ffi.h>
#include <chrono>

uint64_t nanos() {
   auto now = std::chrono::system_clock::now().time_since_epoch();
   return std::chrono::duration_cast<std::chrono::nanoseconds>(now).count();
   }

int main() {
   uint8_t output[64] = { 0 };
   const std::string passphrase = "very secret and everything";
   uint8_t salt[16] = { 0 };

   const size_t iterations = 10000;
   const size_t trials = 10000;
   uint64_t time = 0;

   for(size_t i = 0; i != trials; ++i) {
      uint64_t start = nanos();

      int rc = botan_pwdhash("OpenPGP-S2K(SHA-1)", iterations, 0, 0,
                             output, sizeof(output),
                             passphrase.data(), passphrase.size(),
                             salt, sizeof(salt));
      time += nanos() - start;

      if(rc != 0) {
         printf("rc %d\n", rc);
         return 1;
      }

      salt[i % sizeof(salt)] += 1;
   }

   printf("Botan %d.%d: In %d trials took %.02f ns per trial (%d iterations)\n",
          botan_version_major(), botan_version_minor(),
          trials, static_cast<double>(time) / trials,
          iterations);
}

Botan 3.0: In 10000 trials took 132142.55 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 121881.86 ns per trial (10000 iterations)

randombit commented 4 months ago

@ni4 Can you run the above program on your machine with 3.0.0 vs 3.6.0 and post the results? (Tuned if necessary so the inputs/params match RNP's usage)

reneme commented 4 months ago

macOS 14.5 on MacBook Air M2, Apple clang 15

Botan 3.0: In 10000 trials took 38501.70 ns per trial (10000 iterations) Botan 3.1: In 10000 trials took 39105.30 ns per trial (10000 iterations) Botan 3.2: In 10000 trials took 40112.70 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 40755.60 ns per trial (10000 iterations)

... ~~there does seem to be a little slowdown of about 5%, in release mode.~~ Mhh, actually not. With multiple runs, it gets lost in the jitter. @ni mentioned he's running it with "enabled debugging", that might inflate this?!

I'll try to bisect over the releases...

randombit commented 4 months ago

@ni4 Also can you post exact configure.py flags?

reneme commented 4 months ago

Debug mode, perhaps? (./configure.py --build-targets=static,cli,tests --compiler-cache=ccache --werror-mode --minimized-build --enable-modules=sha1,pgp_s2k,sha1_armv8 --debug-mode )

Botan 3.0: In 10000 trials took 377942.90 ns per trial (10000 iterations) Botan 3.1: In 10000 trials took 382893.70 ns per trial (10000 iterations) Botan 3.2: In 10000 trials took 1064952.30 ns per trial (10000 iterations) Botan 3.6: In 10000 trials took 1071368.10 ns per trial (10000 iterations)

... Using --with-debug this shows a three-fold slowdown. As reported...

randombit commented 4 months ago

--debug-mode disables all optimizations though!

Maybe the issue is that we've made changes that are "free" with optimization but which cause the compile to emit worse code with no optimizations?

I'm half tempted to remove --debug-mode and --no-optimizations flags as footguns

reneme commented 4 months ago

FWIW: The signal in debug mode is big enough to bisect based on it. This commit introduces the slowdown: https://github.com/randombit/botan/commit/1e1b4316a92f1d00e00f22aed76f96ac58ebe3b1

SHA-1 starts using MerkleDamgard_Hash<> in this commit (before it used the MDx_HashFunction base class that is now removed). And MerkleDamgard_Hash<> uses AlignmentBuffer internally.

reneme commented 4 months ago

Maybe the issue is that we've made changes that are "free" with optimization but which cause the compile to emit worse code with no optimizations?

Most certainly! I can totally imagine that the added abstractions are harming performance if optimizations are disabled. In contrast to the fairly verbatim implementation that was there before. In fact https://github.com/randombit/botan/commit/0ed41ccbcdd95fad66901fb15fcb4b97590198e6 also introduces a measurable slowdown when using --debug-mode, I'm guessing due to the introduction of BufferSlicer.

Perhaps @ni4 could use --with-debug-info instead of --debug-mode?

ni4 commented 4 months ago

Oh, sorry, got distracted from the email so lagged a bit with replies. Definitely --debug-mode is the reason for this, replacing it with --with-debug-info speeds things up dramatically. Now the same test runs ~2 seconds instead of 10 or 30. Thanks for your time and efforts on resolving this! Now this could be closed I guess.

randombit commented 4 months ago

@ni4 No problem at all, thanks for reporting this - unexpected/undocumented performance regressions are definitely something we want to know about.

randombit / botan

Significant slowdown of S2K key derivation/SHA1 hashing #4243