Open solardiz opened 2 weeks ago
I was actually going to complain about just that (not realizing I was the one who put the figures there). My 8-core laptop and some 4-core workstation I happen to have nearby show just as fine speeds with them at 1-1-1 and that's obviously better for heavier costs.
As for test vectors, FWIW KeePassXC defaults to Argon2d, 64 MiB, p=2 and then aims for a t that ends up in about a second (I think) which can be about 10 - 35 on vanilla hardware. During my work with keepass formats I created a test vector matching our current ones for Argon2 in order to see where I'm at (on GPU I only get 10% of Alain's speed). I think I will do the reverse as well, add a keepass-matching test vector to the argon2 formats that can be benchmarked with --cost
.
My 8-core laptop and some 4-core workstation I happen to have nearby show just as fine speeds with them at 1-1-1 and that's obviously better for heavier costs.
Well, I've just tested in a VM on my laptop under other load, and 1-1-1 is usually significantly slower at max threads. But when I limit threads to core count minus 1, it's fine. Indeed, having more work to do per crypt_all
lets any fluctuation in threads' performance even out better before the sync point at the end of loop is reached. I think it's similar for other formats we have, so we probably shouldn't tune default OMP_SCALE for best performance under load. Besides, things will be better with real Argon2 hashes, which tend to use way more than 4 MiB.
Edit: I ran these tests in code without the pending memory allocation regression fix. So the (de)allocation was still in the loop. I don't know if fixing that would maybe affect this.
add a keepass-matching test vector to the argon2 formats that can be benchmarked with --cost.
I wonder/worry how much it would slow down self-test. Even if not included in benchmarks by default, it would be in tests.
Also, we could need the equivalent of my 1d7397aae06808ccbdc2065cf274d63116d2dc3b that we currently only have in the Armory format. At large thread counts, having e.g. 256x 64 MiB still allocated during cracking when actual loaded hashes are e.g. 16 MiB can be quite wasteful. (It was even worse in the Armory format because it uses SIMD to handle multiple "hashes" per thread. So it was easily tens of gigabytes during self-test on a huge AWS instance.)
We should add a flag field to format_tests. In this case flagging "test vectors that are to be ignored for normal self-test".
I was wondering why on a system under other load, where I get a speedup at --test --format=argon2 --tune=16
, I am only getting an OMP_SCALE of 1 (same as our new default, and slower on this system) with --tune=auto
. Looks like that's because for auto-tuning we pick a salt with the highest first tunable cost:
// Find most expensive salt, for auto-tune
{
struct db_salt *s = db->salts;
tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);
while (s->next && s->cost[0] < tune_cost)
s = s->next;
salt = s->salt;
}
Why is that? Perhaps so that we don't exceed the maximum crypt_all
duration even with the highest cost? But anyway, it's only the highest among test vectors, and actual can be higher yet.
We could want to reconsider and instead auto-tune using one of the first two test vectors that we use for benchmarking.
Suggested auto-tune patch:
+++ b/src/omp_autotune.c
@@ -126,7 +126,7 @@ void omp_autotune_run(struct db_main *db)
{
struct db_salt *s = db->salts;
- tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]);
+ tune_cost = MIN(db->real ? db->max_cost[0] : s->cost[0], options.loader.max_cost[0]);
while (s->next && s->cost[0] < tune_cost)
s = s->next;
which I think makes us auto-tune to first test vector when benchmarking, but to highest first cost hash when cracking.
- tune_cost = MIN(db->max_cost[0], options.loader.max_cost[0]); + tune_cost = MIN(db->real ? db->max_cost[0] : s->cost[0], options.loader.max_cost[0]);
No, this doesn't always work right - seemed to work for Argon2, but took a 32 MiB test vector for Armory whereas its first one is 8 MiB. I think the problem is the salts are not in the same order as test vectors. Looks like we'll need an extra loop to find the salt corresponding to the first test vector.
We've reset MKPC and OMP_SCALE to 1.
As to benchmark hashes, there doesn't appear to be one common set of Argon2 parameters in use for password hashing. 4 MiB is certainly too low, but at least it's what we've been using so far.
It would be relevant to see what NetBSD uses. But per these blog posts, they may have gone with something ridiculous for that use case (p > 1):
https://blog.netbsd.org/tnf/entry/gsoc_2019_report_incorporating_the https://blog.netbsd.org/tnf/entry/gsoc_2019_report_update_incorporating
Testing on https://3v4l.org I see that recent PHP uses:
<?php
echo PASSWORD_ARGON2_DEFAULT_MEMORY_COST . "\n";
echo PASSWORD_ARGON2_DEFAULT_TIME_COST . "\n";
echo PASSWORD_ARGON2_DEFAULT_THREADS . "\n";
echo 'Argon2i hash: ' . password_hash('password', PASSWORD_ARGON2I) . "\n";
echo 'Argon2i hash: ' . password_hash('php', PASSWORD_ARGON2I) . "\n";
echo 'Argon2id hash: ' . password_hash('password', PASSWORD_ARGON2ID). "\n";
echo 'Argon2id hash: ' . password_hash('php', PASSWORD_ARGON2ID). "\n";
Output for 8.3.13 | released 2024-10-24 | took 461 ms, 82.45 MiB
65536
4
1
Argon2i hash: $argon2i$v=19$m=65536,t=4,p=1$PXk4GTSVyqSRp6hbK9JWEw$+W0OjHkdgsT1WGjxnz1Je0myvLqlirRW3N/Y8ZCX3mc
Argon2i hash: $argon2i$v=19$m=65536,t=4,p=1$Z8iaCGW08RJ/cqtrnRAByA$3KDSPE/cJF0Olc+nI7lYVHv/9K94RDOud/o9ztYb7fc
Argon2id hash: $argon2id$v=19$m=65536,t=4,p=1$7Au984bRf8QQjhKA7lwJeA$blEnuab1lTEYzRnTtOup73BM1Ca65sbMY8my0NLed/0
Argon2id hash: $argon2id$v=19$m=65536,t=4,p=1$6wNfsywSbL+ohVxnOlPIpA$rG8ThMfszdwTe9Zprff0UFj3haeIk30oEPtcMedv27Y
(but many of their builds of PHP don't seem to support Argon2 at all - perhaps it's compile-time optional)
There doesn't appear to be a default between 2i and 2id. PASSWORD_DEFAULT
still uses bcrypt.
Since Argon2id is newer, I guess if they ever want to make Argon2 the default, this is a more likely choice. So we may want to standardize on that for benchmarks.
BTW isn't the CPU Argon2 code capable of threading (for p > 1) by itself? Would it ever be beneficial to use that instead of (or together with, depending on p
) our usual one?
BTW isn't the CPU Argon2 code capable of threading (for p > 1) by itself? Would it ever be beneficial to use that instead of (or together with, depending on
p
) our usual one?
I was thinking of this too, and we could want to have a separate issue for that. Some notes and thoughts:
p
.p
and total thread count efficiently - e.g., it's not always divisible by p
.p
.)crypt_all
, rather than use pre-existing upstream code with whatever threading model they use.crypt_all
.
I notice that in 6b9f850a1654e2d176eb3e59a51a275c9f224ab7 we got:
These values puzzle me. For something as slow as Argon2 normally is, I'd expect us to be able to set all of these to 1 without performance loss. Like we have them in the scrypt format.
We should re-test this with different values on different systems.
Also, our default benchmark hashes for the Argon2 formats are rather unusual - they use 4 MiB. In scrypt, we use 16 MiB. We could want to upgrade these to 16 MiB as well, which would contribute to lower MKPC and OMP_SCALE being optimal.