Argon2: Reconsider MKPC, OMP_SCALE, benchmark hashes

openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs

https://www.openwall.com/john/

Other

10.36k stars 2.11k forks source link

Argon2: Reconsider MKPC, OMP_SCALE, benchmark hashes #5567

Open solardiz opened 2 weeks ago

solardiz commented 2 weeks ago

I notice that in 6b9f850a1654e2d176eb3e59a51a275c9f224ab7 we got:

-#define MAX_KEYS_PER_CRYPT      1
+#define MAX_KEYS_PER_CRYPT      2

-#define OMP_SCALE               16
+#define OMP_SCALE               8 // tuned w/ MKPC for core i7m

These values puzzle me. For something as slow as Argon2 normally is, I'd expect us to be able to set all of these to 1 without performance loss. Like we have them in the scrypt format.

We should re-test this with different values on different systems.

Also, our default benchmark hashes for the Argon2 formats are rather unusual - they use 4 MiB. In scrypt, we use 16 MiB. We could want to upgrade these to 16 MiB as well, which would contribute to lower MKPC and OMP_SCALE being optimal.

magnumripper commented 2 weeks ago

I was actually going to complain about just that (not realizing I was the one who put the figures there). My 8-core laptop and some 4-core workstation I happen to have nearby show just as fine speeds with them at 1-1-1 and that's obviously better for heavier costs.

As for test vectors, FWIW KeePassXC defaults to Argon2d, 64 MiB, p=2 and then aims for a t that ends up in about a second (I think) which can be about 10 - 35 on vanilla hardware. During my work with keepass formats I created a test vector matching our current ones for Argon2 in order to see where I'm at (on GPU I only get 10% of Alain's speed). I think I will do the reverse as well, add a keepass-matching test vector to the argon2 formats that can be benchmarked with --cost.

solardiz commented 2 weeks ago

My 8-core laptop and some 4-core workstation I happen to have nearby show just as fine speeds with them at 1-1-1 and that's obviously better for heavier costs.

Well, I've just tested in a VM on my laptop under other load, and 1-1-1 is usually significantly slower at max threads. But when I limit threads to core count minus 1, it's fine. Indeed, having more work to do per crypt_all lets any fluctuation in threads' performance even out better before the sync point at the end of loop is reached. I think it's similar for other formats we have, so we probably shouldn't tune default OMP_SCALE for best performance under load. Besides, things will be better with real Argon2 hashes, which tend to use way more than 4 MiB.

Edit: I ran these tests in code without the pending memory allocation regression fix. So the (de)allocation was still in the loop. I don't know if fixing that would maybe affect this.

add a keepass-matching test vector to the argon2 formats that can be benchmarked with --cost.

I wonder/worry how much it would slow down self-test. Even if not included in benchmarks by default, it would be in tests.

Also, we could need the equivalent of my 1d7397aae06808ccbdc2065cf274d63116d2dc3b that we currently only have in the Armory format. At large thread counts, having e.g. 256x 64 MiB still allocated during cracking when actual loaded hashes are e.g. 16 MiB can be quite wasteful. (It was even worse in the Armory format because it uses SIMD to handle multiple "hashes" per thread. So it was easily tens of gigabytes during self-test on a huge AWS instance.)

magnumripper commented 2 weeks ago

We should add a flag field to format_tests. In this case flagging "test vectors that are to be ignored for normal self-test".

solardiz commented 1 week ago

I was wondering why on a system under other load, where I get a speedup at --test --format=argon2 --tune=16, I am only getting an OMP_SCALE of 1 (same as our new default, and slower on this system) with --tune=auto. Looks like that's because for auto-tuning we pick a salt with the highest first tunable cost:

        // Find most expensive salt, for auto-tune
        {
                struct db_salt *s = db->salts;

                tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);

                while (s->next && s->cost[0] < tune_cost)
                        s = s->next;
                salt = s->salt;
        }

Why is that? Perhaps so that we don't exceed the maximum crypt_all duration even with the highest cost? But anyway, it's only the highest among test vectors, and actual can be higher yet.

We could want to reconsider and instead auto-tune using one of the first two test vectors that we use for benchmarking.

solardiz commented 1 week ago

Suggested auto-tune patch:

+++ b/src/omp_autotune.c
@@ -126,7 +126,7 @@ void omp_autotune_run(struct db_main *db)
        {
                struct db_salt *s = db->salts;

-               tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);
+               tune_cost = MIN(db->real ? db->max_cost[0] : s->cost[0], options.loader.max_cost[0]);

                while (s->next && s->cost[0] < tune_cost)
                        s = s->next;

which I think makes us auto-tune to first test vector when benchmarking, but to highest first cost hash when cracking.

solardiz commented 1 week ago

-               tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);
+               tune_cost = MIN(db->real ? db->max_cost[0] : s->cost[0], options.loader.max_cost[0]);

No, this doesn't always work right - seemed to work for Argon2, but took a 32 MiB test vector for Armory whereas its first one is 8 MiB. I think the problem is the salts are not in the same order as test vectors. Looks like we'll need an extra loop to find the salt corresponding to the first test vector.

solardiz commented 1 week ago

We've reset MKPC and OMP_SCALE to 1.

As to benchmark hashes, there doesn't appear to be one common set of Argon2 parameters in use for password hashing. 4 MiB is certainly too low, but at least it's what we've been using so far.

It would be relevant to see what NetBSD uses. But per these blog posts, they may have gone with something ridiculous for that use case (p > 1):

https://blog.netbsd.org/tnf/entry/gsoc_2019_report_incorporating_the https://blog.netbsd.org/tnf/entry/gsoc_2019_report_update_incorporating

solardiz commented 1 week ago

Testing on https://3v4l.org I see that recent PHP uses:

<?php
echo PASSWORD_ARGON2_DEFAULT_MEMORY_COST . "\n";
echo PASSWORD_ARGON2_DEFAULT_TIME_COST . "\n";
echo PASSWORD_ARGON2_DEFAULT_THREADS . "\n";
echo 'Argon2i hash: ' . password_hash('password', PASSWORD_ARGON2I) . "\n";
echo 'Argon2i hash: ' . password_hash('php', PASSWORD_ARGON2I) . "\n";
echo 'Argon2id hash: ' . password_hash('password', PASSWORD_ARGON2ID). "\n";
echo 'Argon2id hash: ' . password_hash('php', PASSWORD_ARGON2ID). "\n";

Output for 8.3.13 | released 2024-10-24 | took 461 ms, 82.45 MiB
    65536
    4
    1
    Argon2i hash: $argon2i$v=19$m=65536,t=4,p=1$PXk4GTSVyqSRp6hbK9JWEw$+W0OjHkdgsT1WGjxnz1Je0myvLqlirRW3N/Y8ZCX3mc
    Argon2i hash: $argon2i$v=19$m=65536,t=4,p=1$Z8iaCGW08RJ/cqtrnRAByA$3KDSPE/cJF0Olc+nI7lYVHv/9K94RDOud/o9ztYb7fc
    Argon2id hash: $argon2id$v=19$m=65536,t=4,p=1$7Au984bRf8QQjhKA7lwJeA$blEnuab1lTEYzRnTtOup73BM1Ca65sbMY8my0NLed/0
    Argon2id hash: $argon2id$v=19$m=65536,t=4,p=1$6wNfsywSbL+ohVxnOlPIpA$rG8ThMfszdwTe9Zprff0UFj3haeIk30oEPtcMedv27Y

(but many of their builds of PHP don't seem to support Argon2 at all - perhaps it's compile-time optional)

There doesn't appear to be a default between 2i and 2id. PASSWORD_DEFAULT still uses bcrypt.

Since Argon2id is newer, I guess if they ever want to make Argon2 the default, this is a more likely choice. So we may want to standardize on that for benchmarks.

magnumripper commented 1 week ago

BTW isn't the CPU Argon2 code capable of threading (for p > 1) by itself? Would it ever be beneficial to use that instead of (or together with, depending on p) our usual one?

solardiz commented 1 week ago

BTW isn't the CPU Argon2 code capable of threading (for p > 1) by itself? Would it ever be beneficial to use that instead of (or together with, depending on p) our usual one?

I was thinking of this too, and we could want to have a separate issue for that. Some notes and thoughts:

It's easier to parallelize at JtR level rather than inside a hash computation. We need parallelization at JtR level anyway for efficient computation of hashes with insufficient p.
It's difficult to support arbitrary mixes of p and total thread count efficiently - e.g., it's not always divisible by p.
Yet there are memory savings and maybe speedup to be had if we do. (In OpenCL, we do already benefit from higher p.)
As I recall, the original Argon2 code as submitted to PHC had multi-threading support in it, but this is not in the currently maintained code that we reuse. Also, we would probably want to unify on consistent usage of OpenMP in both Argon2 and our crypt_all, rather than use pre-existing upstream code with whatever threading model they use.
We have similar concerns and potential task for (ye)scrypt, and for the various formats using scrypt and Argon2 as a primitive.
My upstream yescrypt code has OpenMP support in it, which in JtR is currently disabled in favor of usage of OpenMP in crypt_all.