Database unlock is slow in 2.7.0-beta1

ghost commented 2 years ago

Overview

Unlocking a database in KeePassXC 2.7.0-beta1 is much slower than unlocking the same database in KeePassXC 2.6.6 or KeePass 2.50.

Steps to Reproduce

Create a database with Argon2d Iterations 128, Memory 64MB, Parallelism 8.
Unlock the database.

Expected Behavior

Unlocking the database takes about 2 seconds.

Actual Behavior

Unlocking the database takes about 9 seconds.

Context

With KeePassXC 2.6.6 or KeePass 2.50, unlocking takes about 2 seconds. I created a database with KeePassXC 2.6.6, KeePassXC 2.7.0-beta1 and KeePass 2.50, but it happened with any database. I suspected antivirus (Kaspersky) interference and paused protection before unlocking it, but it didn't improve. It was also reproduced in Windows 11 Insider Preview Dev on VMware Player.

CPU: Core i7-7700K RAM: DDR4-2400 16GB

KeePassXC - 2.7.0-beta1 Revision: 046e508

Operating System: Windows 10 21H2

phoerious commented 2 years ago

I suppose Botan is less optimized than libargon2. @droidmonkey Should we do some benchmarks?

droidmonkey commented 2 years ago

From 2 to 9 seconds is extreme and interesting. I definitely notice a slower speed with botan, but not >4x slower.

https://github.com/microsoft/vcpkg/tree/master/ports%2Fbotan

https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-libbotan

https://github.com/keepassxreboot/keepassxc/pull/6209#issuecomment-792051844

@randombit I am building Botan using vcpkg, might we be missing an optimization flag in there?

randombit commented 2 years ago

Are you sure you are not using their dbg mode? That sets --debug-mode to the library configure script, which disables optimizations completely.

Also I have no idea of the performance of Argon2 on VC. In general I use GCC/Clang on Linux, and IME the VC optimizer is significantly less good.

The iteration count for Argon2 seems really high given that RFC 9106 recommends no more than t=3. I doubt I ever checked performance with much higher iterations. I'll look into this. I remember a long time ago checking the libargon2 library vs Botan and they were about the same but I'm sure that was with a small t.

ghost commented 2 years ago

When I tried it with Iterations 3 Memory 2047MB Parallelism 8, it took about 1-2 seconds with KeePassXC 2.6.6, 1.3 seconds with KeePass 2.50, and about 6-7 seconds with KeePassXC 2.7.0-beta1. When I tried it again with Iterations 128 Memory 64MB Parallelism 8, it took about 8 seconds, so I don't think the impact of Iterations is significant.

(Is it better to reduce the number of iterations? The default setting of the KeePass 2.50 is Memory 64MB, so I used it.)

phoerious commented 2 years ago

It should definitely not be slower than KeePass2. In fact, it should be much faster than that.

droidmonkey commented 2 years ago

You reduced iterations and significantly increased memory. Do one or the other. I certainly don't see the level of degradation that you do under a 1 second benchmark. My database decrypts in roughly the same time between my mobile (KeePassDX) and laptop (KPXC 2.7.0). My laptop should be much faster than mobile, but it's not.

I'll have to check on the dbg mode.

randombit commented 2 years ago

Comparing libargon2 (https://github.com/P-H-C/phc-winner-argon2 as packaged on Arch Linux) with Botan. This is a i7-6700K at 4.2 GHz. Both using Argon2id, 64 Mb, t=128, p=8

Botan: 4.8 s (via botan speed argon2) libargon2: 10.4 s (using time reported by argon2 cli util)

[Why this discrepancy is so large in favor of Botan I have no idea]

... oh now I see why ...

libargon2 does use threads for p > 1 (Botan does not) but the time it reports is - I guess - the sum total across all of the threads, rather than the wall clock time which is what anyone normal cares about. So for above example the cli util reports 10.4 s but actually only takes 2.7 seconds wall time.

OK so for p > 1 you are going to see a fairly large perf hit. I can look into using threads (there is even a todo in the source about this) but that won't help when building against existing releases of the library.

droidmonkey commented 2 years ago

Oof that is a big deal if we aren't doing parallelism in threads. This may make me want to go back to the reference implementation.

phoerious commented 2 years ago

I am in favour of that until this is fixed. I can confirm that 2.6.6 with p > 1 decrypts much faster. When I benchmark to 128MiB at 16 threads, I get 6 rounds in 2.7.0-beta1 and 44 rounds in 2.6.6.

But that's not all. If I set p = 1, I still get 6 rounds in 2.7.0-beta1 (not surprisingly), but 10 rounds in 2.6.6. So while the missing support for threads is the biggest issue, the whole thing seems less efficient than the reference implementation.

droidmonkey commented 2 years ago

Luckily it's not a big deal to go back to libargon2. Will try to get that in a PR tonight for review.

@randombit why not just embed and wrap the reference implementation by Botan API? Their license is CC0 / Apache.

randombit commented 2 years ago

@droidmonkey The way their code is written, integrating it/adapting it would be harder than just adding threading and SIMD support directly. PR adding threads is up now, SSSE3 version is written but needs some debugging still. I'll backport these in the next 2.x release. Probably too late to get into upcoming Ubuntu 22.04 though. Wish this issue had been raised sooner. :/

phoerious commented 2 years ago

Great, thanks for resolving this quickly. @droidmonkey Could we perhaps compile conditionally, depending on the installed Botan version?

randombit commented 2 years ago

Just as an FYI, I wrote a quick tool to properly benchmark the performance of libargon2 (https://gist.github.com/randombit/adaaaecfc23906af0fe8f3e7ea0257cd) since their cli util is clearly not so helpful for this. For params M=65536, t=128, p=8, on an 4 core CPU with AVX2 and HT, I am seeing 1386 ms/per hash with Botan+threads, and 1375 ms/per hash with libargon2, basically the same. I think libargon2 has the edge in that it has SIMD support, but Botan is making up for most of that by using a thread pool instead of creating+killing many threads.

@phoerious I'm curious about the results you are seeing with p = 1. Windows or Linux? Does your CPU have AVX-512?

phoerious commented 2 years ago

@phoerious I'm curious about the results you are seeing with p = 1. Windows or Linux? Does your CPU have AVX-512?

Windows, Ryzen 3950X, so I guess not. Also tested it on the Threadripper 2920X machine I'm working on right now (Ubuntu Linux, AVX2 only) and I get roughly the same performance there (8 vs. 7, sometimes 8).

Here's botan speed (2.17.3) vs your libargon2 benchmark tool with t=1, m=65536, p=1) on the Threadripper:

Argon2id M=65536 t=1 p=1 15 /sec; 65.44 ms/op 229022108 cycles/op (46 ops in 3010 ms)

42 runs in 3069.218427 ms -> 73.076629 ms/run

I can test the Ryzen in the evening.

phoerious commented 2 years ago

@randombit Here's the results on my Ryzen (Windows, compiled via Msys2):

Argon2id M=65536 t=1 p=1 16 /sec; 60.08 ms/op 600744 cycles/op (50 ops in 3004 ms)

59 runs in 3045.006700 ms -> 51.610283 ms/run

keepassxreboot / keepassxc