Look for performance regressions compared to 1.8.0-jumbo-1

magnumripper commented 6 years ago

Trivial task but takes time. All these compared to Jumbo-1 (plus whatever little patch is needed to build) and disregarding all new formats:

[x] relbench (btw IMHO it should be relbench.pl) for non-OMP
[ ] relbench for OMP w/o HT
[x] relbench for OMP w/ HT

and obviously

[ ] relbench for OpenCL using same device and same LWS/GWS

and why not

[ ] relbench for OpenCL using same device and auto-tune

solardiz commented 6 years ago

We should take care of "max_keys_per_crypt tuning" first: http://www.openwall.com/lists/john-dev/2017/11/12/1

frank-dittrich commented 6 years ago

@magnumripper relbench is a core file, you have to convince @solardiz to rename it to relbench.pl.

benchmark-unify also might need fixes so that relbench can be used on more formats.

jfoug commented 6 years ago

We should take care of "max_keys_per_crypt tuning" first: http://www.openwall.com/lists/john-dev/2017/11/12/1

Agreed, but that does not stop someone from getting started on the relbench adjustments needed.

It also may be good to baseline things prior to work on the above tuning, giving a better insight into just how much change is seen in the tuning.

magnumripper commented 6 years ago

Good points

solardiz commented 5 years ago

My first relbench result comparing --disable-openmp 64-bit AVX builds of 1.8.0-jumbo-1 vs. our current code with Benchmarks_1_8 = Y:

Number of benchmarks:           365
Minimum:                        0.55817 real, 0.56404 virtual
Maximum:                        135.48364 real, 135.48364 virtual
Median:                         1.06286 real, 1.06286 virtual
Median absolute deviation:      0.10187 real, 0.09938 virtual
Geometric mean:                 1.22873 real, 1.22862 virtual
Geometric standard deviation:   1.62511 real, 1.62456 virtual

This is after benchmark-unify run on both versions' outputs. There were 381 formats benchmarked with the old version, and 412 with the new one. Yet we're able to compare only 365 benchmarks as not everything is unified. There's also some unexpected weirdness:

Warning: some benchmark results are missing virtual (CPU) time data

(and lots of other warnings that I was able make sense of already)

The above was on "super". Here's the same on "well" with --disable-openmp --enable-simd=avx:

Number of benchmarks:           365
Minimum:                        0.55704 real, 0.55704 virtual
Maximum:                        133.96128 real, 133.96128 virtual
Median:                         1.01762 real, 1.01708 virtual
Median absolute deviation:      0.07282 real, 0.07336 virtual
Geometric mean:                 1.09430 real, 1.09409 virtual
Geometric standard deviation:   1.52811 real, 1.52798 virtual

(It is possible that the 1.8.0-jumbo-1 built made some use of AVX2 here. I didn't disable AVX2 for it, and it detected AVX2 at build time, but then said only AVX about each format benchmarked.)

The good news is that overall we got some speedup. The worst slowdown is ~2x and the best speedup is 130x+. To see where they are:

-       if ($verbose == 1) {
+       if ($verbose == 1 && ($kr < 0.9 || $kr > 10)) {
                printf "Ratio:\t%.5f real, %.5f virtual\t$id\n", $kr, $kv;
        }

Using data from "super" (which I think is more reliable - guaranteed no AVX2 anywhere):

Ratio:  0.84615 real, 0.84615 virtual   Blockchain, My Wallet (x10):Raw
Ratio:  0.55817 real, 0.56404 virtual   KeePass:Raw
Ratio:  32.04659 real, 31.72982 virtual PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+:Raw
Ratio:  0.65076 real, 0.65076 virtual   PKZIP:Many salts
Ratio:  0.83316 real, 0.83316 virtual   PKZIP:Only one salt
Ratio:  13.06640 real, 13.06640 virtual SybaseASE, Sybase ASE:Many salts
Ratio:  0.77743 real, 0.77743 virtual   dmg, Apple DMG:Raw
Ratio:  0.73261 real, 0.73992 virtual   dynamic_15:Many salts
Ratio:  0.82377 real, 0.82377 virtual   dynamic_16:Many salts
Ratio:  0.76493 real, 0.76493 virtual   dynamic_24:Many salts
Ratio:  0.76474 real, 0.76474 virtual   dynamic_25:Many salts
Ratio:  0.71870 real, 0.71870 virtual   dynamic_35:Many salts
Ratio:  0.72531 real, 0.72531 virtual   dynamic_36:Many salts
Ratio:  0.76323 real, 0.76323 virtual   dynamic_37:Many salts
Ratio:  0.81191 real, 0.81191 virtual   dynamic_40:Many salts
Ratio:  0.83398 real, 0.83398 virtual   dynamic_61:Many salts
Ratio:  0.80492 real, 0.79694 virtual   dynamic_1016:Many salts
Ratio:  0.76038 real, 0.76038 virtual   dynamic_1401:Many salts
Ratio:  0.87405 real, 0.86527 virtual   dynamic_1401:Only one salt
Ratio:  0.87461 real, 0.87461 virtual   dynamic_1501:Only one salt
Ratio:  0.75675 real, 0.75675 virtual   dynamic_1504:Many salts
Ratio:  0.84177 real, 0.84177 virtual   dynamic_2001:Many salts
Ratio:  0.79714 real, 0.79714 virtual   dynamic_2004:Many salts
Ratio:  0.72865 real, 0.72865 virtual   dynamic_2005:Many salts
Ratio:  0.87991 real, 0.87991 virtual   dynamic_2006:Many salts
Ratio:  0.83394 real, 0.82568 virtual   dynamic_2008:Many salts
Ratio:  0.79867 real, 0.79867 virtual   dynamic_2009:Many salts
Ratio:  0.82579 real, 0.82579 virtual   dynamic_2010:Many salts
Ratio:  0.82581 real, 0.82581 virtual   dynamic_2011:Many salts
Ratio:  0.73347 real, 0.73347 virtual   dynamic_2014:Many salts
Ratio:  0.79065 real, 0.79852 virtual   net-sha1, "Keyed SHA1" BFD:Many salts
Ratio:  0.79091 real, 0.79091 virtual   sha1crypt, NetBSD's sha1crypt:Raw
Ratio:  135.48364 real, 135.48364 virtual       vtp, "MD5 based authentication" VTP:Many salts

There are also many smaller speedups, e.g. 28% for md5crypt, which is great, but for now I'd like to document primarily the regressions.

solardiz commented 5 years ago

"KeePass" is the worst - 56% of original speed, @kholia @Fist0urs. Was:

[solar@super run]$ pwd
/home/solar/j/john-1.8.0-jumbo-1/run
[solar@super run]$ ./john -test -form=keepass
Benchmarking: KeePass [SHA256 AES 32/64 OpenSSL]... DONE
Raw:    85.1 c/s real, 85.1 c/s virtual

Now:

[solar@super run]$ ./john -test -form=keepass
Benchmarking: KeePass [SHA256 AES 32/64]... DONE
Speed for cost 1 (iteration count) of 50000 and 6000, cost 2 (version) of 1 and 2, cost 3 (algorithm [0=AES, 1=TwoFish, 2=ChaCha]) of 0
Raw:    47.5 c/s real, 47.5 c/s virtual

Both are non-OpenMP builds, so benchmarking a single CPU core. A lot changed about the code, but the first two test vectors have stayed the same, so the slowdown is probably for real.

Next is "PKZIP:Many salts" at 65% of original, @jfoug.

The rest are at 72%+ of original. Lots of "dynamic_*" are impacted, but also "Blockchain, My Wallet (x10)", "dmg", "net-sha1", "sha1crypt".

solardiz commented 5 years ago

The major speedups are interesting. Was:

Benchmarking: PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 128/128 SSE4.1 2x]... DONE
Raw:    60.1 c/s real, 60.7 c/s virtual

Benchmarking: sybasease, Sybase ASE [SHA256 32/64 OpenSSL]... DONE
Many salts:     322583 c/s real, 322583 c/s virtual
Only one salt:  317174 c/s real, 317174 c/s virtual

Benchmarking: vtp, "MD5 based authentication" VTP [MD5 32/64]... DONE
Many salts:     5593 c/s real, 5593 c/s virtual
Only one salt:  5593 c/s real, 5593 c/s virtual

Now:

Benchmarking: PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+ [PBKDF2-SHA512 128/128 AVX 2x]... DONE
Speed for cost 1 (iteration count) of 1000
Raw:    1926 c/s real, 1926 c/s virtual

Benchmarking: SybaseASE, Sybase ASE [SHA256 128/128 AVX 4x]... DONE
Many salts:     4215K c/s real, 4215K c/s virtual
Only one salt:  960424 c/s real, 960424 c/s virtual

Benchmarking: vtp, "MD5 based authentication" VTP [MD5 32/64]... DONE
Many salts:     757760 c/s real, 757760 c/s virtual
Only one salt:  5712 c/s real, 5712 c/s virtual

solardiz commented 5 years ago

I've just created issues #3811 #3812 #3813 #3814 #3815 #3816 #3817 for the 10%+ performance regressions.

solardiz commented 5 years ago

Regressions in the >5% to 10% range:

Ratio:  0.90076 real, 0.90076 virtual   whirlpool1:Raw
Ratio:  0.90183 real, 0.90183 virtual   whirlpool0:Raw
Ratio:  0.90584 real, 0.91475 virtual   dynamic_38:Many salts
Ratio:  0.90681 real, 0.90681 virtual   netntlmv2, NTLMv2 C/R:Only one salt
Ratio:  0.90692 real, 0.90692 virtual   EPI, EPiServer SID:Many salts
Ratio:  0.92261 real, 0.92261 virtual   dynamic_35:Only one salt
Ratio:  0.92698 real, 0.92698 virtual   dynamic_1502:Only one salt
Ratio:  0.93177 real, 0.94105 virtual   dynamic_39:Many salts
Ratio:  0.93334 real, 0.93334 virtual   dynamic_36:Only one salt
Ratio:  0.93613 real, 0.92674 virtual   dynamic_38:Only one salt
Ratio:  0.93831 real, 0.93831 virtual   sapg, SAP CODVN F/G (PASSCODE):Many salts
Ratio:  0.93902 real, 0.93902 virtual   mysqlna, MySQL Network Authentication:Raw
Ratio:  0.94161 real, 0.93225 virtual   po, Post.Office:Many salts
Ratio:  0.94406 real, 0.94406 virtual   Salted-SHA1:Many salts
Ratio:  0.94693 real, 0.94693 virtual   dynamic_15:Only one salt
Ratio:  0.94918 real, 0.94918 virtual   dynamic_2014:Only one salt
Ratio:  0.94924 real, 0.94924 virtual   oracle, Oracle 10:Raw
Ratio:  0.94988 real, 0.94988 virtual   dynamic_40:Only one salt

Regressions in the >2% to 5% range:

Ratio:  0.95000 real, 0.95714 virtual   RAR5:Raw
Ratio:  0.95089 real, 0.95089 virtual   net-md5, "Keyed MD5" RIPv2, OSPF, BGP, SNMPv2:Many salts
Ratio:  0.95448 real, 0.96408 virtual   Fortigate, FortiOS:Many salts
Ratio:  0.95529 real, 0.95529 virtual   known_hosts, HashKnownHosts HMAC-SHA1:Many salts
Ratio:  0.95646 real, 0.95646 virtual   net-sha1, "Keyed SHA1" BFD:Only one salt
Ratio:  0.95652 real, 0.96703 virtual   Django (x10000):Raw
Ratio:  0.95975 real, 0.95975 virtual   Panama:Raw
Ratio:  0.96040 real, 0.96040 virtual   dynamic_1503:Only one salt
Ratio:  0.96079 real, 0.96079 virtual   dynamic_1400:Raw
Ratio:  0.96100 real, 0.96100 virtual   ZIP, WinZip:Raw
Ratio:  0.96411 real, 0.96411 virtual   PDF:Many salts
Ratio:  0.96449 real, 0.97414 virtual   aix-ssha256, AIX LPA {ssha256}:Raw
Ratio:  0.96460 real, 0.96460 virtual   Clipperz, SRP:Raw
Ratio:  0.96526 real, 0.96526 virtual   netlmv2, LMv2 C/R:Only one salt
Ratio:  0.96963 real, 0.96963 virtual   dynamic_16:Only one salt
Ratio:  0.97130 real, 0.97130 virtual   aix-ssha1, AIX LPA {ssha1}:Raw
Ratio:  0.97144 real, 0.97144 virtual   chap, iSCSI CHAP authentication / EAP-MD5:Raw
Ratio:  0.97206 real, 0.98196 virtual   kwallet, KDE KWallet:Raw
Ratio:  0.97232 real, 0.97232 virtual   dynamic_1014:Many salts
Ratio:  0.97323 real, 0.97323 virtual   dynamic_25:Only one salt
Ratio:  0.97536 real, 0.97536 virtual   STRIP, Password Manager:Raw
Ratio:  0.97561 real, 0.98613 virtual   OpenBSD-SoftRAID (8192 iterations):Raw
Ratio:  0.97592 real, 0.97592 virtual   dynamic_61:Only one salt
Ratio:  0.97619 real, 0.97619 virtual   krb5-18, Kerberos 5 DB etype 18:Raw
Ratio:  0.97619 real, 0.97619 virtual   krb5pa-sha1, Kerberos 5 AS-REQ Pre-Auth etype 17/18:Raw
Ratio:  0.97654 real, 0.97654 virtual   agilekeychain, 1Password Agile Keychain:Raw
Ratio:  0.97703 real, 0.97703 virtual   PBKDF2-HMAC-SHA1:Raw
Ratio:  0.97710 real, 0.97710 virtual   keychain, Mac OS X Keychain:Raw
Ratio:  0.97756 real, 0.97756 virtual   xsha, Mac OS X 10.4 - 10.6:Many salts
Ratio:  0.97894 real, 0.97894 virtual   Citrix_NS10, Netscaler 10:Many salts
Ratio:  0.97900 real, 0.97900 virtual   krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Only one salt
Ratio:  0.97912 real, 0.97912 virtual   Snefru-256:Raw
Ratio:  0.97919 real, 0.97919 virtual   oracle11, Oracle 11g:Many salts
Ratio:  0.97970 real, 0.97970 virtual   krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Many salts

solardiz commented 5 years ago

32 OpenMP threads on "super":

1.8.0-jumbo-1:

./configure --disable-opencl --disable-cuda
[...]
GOMP_CPU_AFFINITY=0-31 time ./john -test=1 -form=cpu

vs. bleeding-jumbo:

./configure --disable-opencl --enable-openmp-for-fast-formats
GOMP_CPU_AFFINITY=0-31 time ./john -test=1 -form=cpu

Statistics:

Number of benchmarks:           365
Minimum:                        0.22161 real, 0.32920 virtual
Maximum:                        111.72218 real, 113.22292 virtual
Median:                         1.17630 real, 1.05882 virtual
Median absolute deviation:      0.18767 real, 0.08317 virtual
Geometric mean:                 1.34909 real, 1.25595 virtual
Geometric standard deviation:   1.84247 real, 1.76176 virtual

Worse than 10% slowdowns and better than 10x speedups:

Ratio:  0.22161 real, 2.31296 virtual   HAVAL-128-4:Raw
Ratio:  0.25854 real, 2.49230 virtual   HAVAL-256-3:Raw
Ratio:  0.26176 real, 0.81939 virtual   dynamic_1501:Only one salt
Ratio:  0.29110 real, 8.79082 virtual   hdaa, HTTP Digest access authentication:Many salts
Ratio:  0.31832 real, 0.98159 virtual   dynamic_1503:Only one salt
Ratio:  0.33650 real, 0.94009 virtual   dynamic_1502:Only one salt
Ratio:  0.38106 real, 0.76266 virtual   Citrix_NS10, Netscaler 10:Many salts
Ratio:  0.39682 real, 8.63849 virtual   hdaa, HTTP Digest access authentication:Only one salt
Ratio:  0.54651 real, 0.54779 virtual   KeePass:Raw
Ratio:  0.67455 real, 0.65965 virtual   PKZIP:Many salts
Ratio:  0.72499 real, 0.90016 virtual   gost, GOST R 34.11-94:Raw
Ratio:  0.74473 real, 0.97354 virtual   dmd5, DIGEST-MD5 C/R:Raw
Ratio:  0.75468 real, 1.02690 virtual   nk, Nuked-Klan CMS:Raw
Ratio:  0.76698 real, 0.76033 virtual   dmg, Apple DMG:Raw
Ratio:  0.76967 real, 1.11609 virtual   OpenVMS, Purdy:Raw
Ratio:  0.78343 real, 0.79174 virtual   sha1crypt, NetBSD's sha1crypt:Raw
Ratio:  0.79653 real, 1.60955 virtual   postgres, PostgreSQL C/R:Raw
Ratio:  0.80422 real, 0.80295 virtual   krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Many salts
Ratio:  0.83766 real, 0.83787 virtual   aix-ssha1, AIX LPA {ssha1}:Raw
Ratio:  0.84154 real, 0.69620 virtual   netlmv2, LMv2 C/R:Many salts
Ratio:  0.84670 real, 1.08016 virtual   o5logon, Oracle O5LOGON protocol:Raw
Ratio:  0.84851 real, 1.42700 virtual   mysqlna, MySQL Network Authentication:Raw
Ratio:  0.87168 real, 0.66672 virtual   netlmv2, LMv2 C/R:Only one salt
Ratio:  0.88135 real, 0.88186 virtual   lotus85, Lotus Notes/Domino 8.5:Raw
Ratio:  0.89120 real, 0.80078 virtual   dominosec, Lotus Notes/Domino 6 More Secure Internet Password:Many salts
Ratio:  10.72210 real, 0.62853 virtual  oracle, Oracle 10:Raw
Ratio:  12.06874 real, 0.41546 virtual  EPI, EPiServer SID:Many salts
Ratio:  18.00868 real, 0.56399 virtual  SunMD5:Raw
Ratio:  31.18674 real, 30.98592 virtual PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+:Raw
Ratio:  35.03632 real, 1.10299 virtual  hMailServer:Many salts
Ratio:  111.72218 real, 113.22292 virtual       vtp, "MD5 based authentication" VTP:Many salts

solardiz commented 5 years ago

Ratio:  0.29110 real, 8.79082 virtual   hdaa, HTTP Digest access authentication:Many salts

HDAA previously had OpenMP support and now doesn't, not even with --enable-openmp-for-fast-formats:

commit 99e3c779eccb73cc59cdacac147f6303f9582cc3
Author: magnum <john.magnum@hushmail.com>
Date:   Mon Apr 8 00:02:45 2019 +0200

    HDAA: Drop b0rken OpenMP support, since it was poor anyway. Closes #3107.

Was:

Benchmarking: hdaa, HTTP Digest access authentication [MD5 128/128 AVX 12x]... (32xOMP) DONE
Many salts:     18481K c/s real, 610947 c/s virtual
Only one salt:  12750K c/s real, 591811 c/s virtual

Now:

Benchmarking: hdaa, HTTP Digest access authentication [MD5 128/128 AVX 4x3]... DONE
Many salts:     5345K c/s real, 5345K c/s virtual
Only one salt:  5071K c/s real, 5071K c/s virtual

I guess that's OK.

solardiz commented 5 years ago

HAVAL-* need this fix:

+++ b/src/haval_fmt_plug.c
@@ -19,6 +19,7 @@ john_register_one(&fmt_haval_128_4);

 #include <string.h>

+#include "arch.h"
 #if !FAST_FORMATS_OMP
 #undef _OPENMP
 #endif

solardiz commented 5 years ago

dynamic_1501 changed from:

Benchmarking: dynamic_1501 [sha1($salt.sha1($pass) (Redmine) 128/128 AVX 480x4x1]... (32xOMP) DONE
Many salts:     61618K c/s real, 1945K c/s virtual
Only one salt:  13750K c/s real, 948294 c/s virtual

to:

Benchmarking: dynamic_1501 [sha1($s.sha1($p)) (Redmine) 128/128 AVX 4x1]... (32xOMP) DONE
Many salts:     78005K c/s real, 2628K c/s virtual
Only one salt:  3612K c/s real, 783038 c/s virtual

The single-salt speed became much worse. I think this is related to the changed definition of this dynamic (and many others) in dynamic.conf, which started changing with this commit:

commit e84c4659fc5bcfec9740edc5c90153a7f4c23331
Author: jfoug <jfoug@cox.net>
Date:   Thu Jan 1 21:45:37 2015 -0600

    dynamic: larger hashes. Fixed max PLAINTEXT length, as found by test suite

The equivalent command-line dynamic format has faster "Many salts", but just as slow "Only only salt":

[solar@super run]$ GOMP_CPU_AFFINITY=0-31 ./john -test -form='dynamic=sha1($salt.sha1($pass))'
Will run 32 OpenMP threads
Benchmarking: dynamic=sha1($s.sha1($p)) [128/128 AVX 4x1]... (32xOMP) DONE
Many salts:     106619K c/s real, 3721K c/s virtual
Only one salt:  3501K c/s real, 825735 c/s virtual

In 1.8.0-jumbo-1, this syntax wasn't supported, so I can't compare to that.

solardiz commented 5 years ago

"Citrix_NS10, Netscaler 10" changed from:

Benchmarking: Citrix_NS10, Netscaler 10 [SHA1 128/128 AVX 4x]... (32xOMP) DONE
Many salts:     113508K c/s real, 8671K c/s virtual
Only one salt:  26993K c/s real, 6682K c/s virtual

to:

Benchmarking: Citrix_NS10, Netscaler 10 [SHA1 128/128 AVX 4x]... (32xOMP) DONE
Many salts:     43253K c/s real, 6613K c/s virtual
Only one salt:  53346K c/s real, 7000K c/s virtual

This might have been a temporary glitch during the benchmark. Re-running just this test, I get:

Benchmarking: Citrix_NS10, Netscaler 10 [SHA1 128/128 AVX 4x]... (32xOMP) DONE
Many salts:     103546K c/s real, 7636K c/s virtual
Only one salt:  56885K c/s real, 7048K c/s virtual

Increasing OMP_SCALE from 4 to 8, I get:

Benchmarking: Citrix_NS10, Netscaler 10 [SHA1 128/128 AVX 4x]... (32xOMP) DONE
Many salts:     107479K c/s real, 8577K c/s virtual
Only one salt:  60030K c/s real, 7888K c/s virtual

OMP_SCALE=16:

Benchmarking: Citrix_NS10, Netscaler 10 [SHA1 128/128 AVX 4x]... (32xOMP) DONE
Warning: "Many salts" test limited: 211/256
Many salts:     110624K c/s real, 9030K c/s virtual
Only one salt:  61341K c/s real, 8334K c/s virtual

Still not 113.5M we had, but close.

solardiz commented 5 years ago

"gost" is weird. It changed from:

Benchmarking: gost, GOST R 34.11-94 [64/64]... (32xOMP) DONE
Raw:    11276K c/s real, 352714 c/s virtual

to:

Benchmarking: gost, GOST R 34.11-94 [64/64]... (32xOMP) DONE
Raw:    8175K c/s real, 317499 c/s virtual

and this is reproducible. Increasing OMP_SCALE, I can get it to up to 9400K. Decreasing PLAINTEXT_LENGTH from 125 to the old setting of 64 I can get it further to 9700K. But not to the 11M+. However, the "Many salts" benchmark can reach that, if enabled:

Benchmarking: gost, GOST R 34.11-94 [64/64]... (32xOMP) DONE
Many salts:     11747K c/s real, 367563 c/s virtual
Only one salt:  8642K c/s real, 329115 c/s virtual

(this is with only BENCHMARK_LENGTH changed from 0x107 to 7). This format isn't fully salted - rather, it's two different variations of GOST, where the single-bit salt chooses which to use. I think those two are supposed to be same speed. Also, the first two test vectors (where we take salts from for the "Many salts" benchmark) correspond to the same variation. So the speedup must be from amortizing the cost of set_key - this is puzzling since this format's set_key is pretty simple, but I have no other explanation. Hacking 1.8.0-jumbo-1 to also enable the "Many salts" benchmark, I get:

Benchmarking: gost, GOST R 34.11-94 [64/64]... (32xOMP) DONE
Many salts:     11182K c/s real, 350206 c/s virtual
Only one salt:  6449K c/s real, 241360 c/s virtual

So perhaps when applied to a "salted" format 1.8.0-jumbo-1's "Raw" means "Many salts" and 1.9.0's means "Only one salt". That's an unintentional change in benchmarks, perhaps coming from core.

I'm not going to switch "gost" to reporting "Many salts" - there are at most 2 different "salts" for this format, whereas the benchmark assumes 256. I only ran these tests to figure out the problem.

solardiz commented 5 years ago

"dmd5, DIGEST-MD5 C/R:Raw" is another victim of the inadvertent change in benchmarks, but luckily this also reminds us which formats we should switch to reporting separate "Many salts" - it's one of those.

Benchmarking: dmd5, DIGEST-MD5 C/R [MD5 32/64]... (32xOMP) DONE
Many salts:     28704K c/s real, 917378 c/s virtual
Only one salt:  21672K c/s real, 901524 c/s virtual

solardiz commented 5 years ago

"openvms" is also a victim of the inadvertent change in benchmarks, fixed:

Benchmarking: OpenVMS, Purdy [32/64]... (32xOMP) DONE
Many salts:     16171K c/s real, 623881 c/s virtual
Only one salt:  12632K c/s real, 576017 c/s virtual

solardiz commented 5 years ago

Tentative fix for "Raw" benchmarks of salted formats where we don't expect performance to vary significantly by salt count:

+++ b/src/bench.c
@@ -828,6 +828,9 @@ AGAIN:
                salts = 0;
                if (!format->params.salt_size ||
                    (format->params.benchmark_length & 0x100)) {
+                       if (format->params.salt_size &&
+                           !(format->params.benchmark_length & 0x400))
+                               salts = BENCHMARK_MANY;
                        msg_m = "Raw";
                        msg_1 = NULL;
                } else if (format->params.benchmark_length & 0x200) {

along with:

+++ b/src/gost_fmt_plug.c
@@ -47,7 +47,7 @@ john_register_one(&fmt_gost);
 #define ALGORITHM_NAME          "32/" ARCH_BITS_STR
 #endif
 #define BENCHMARK_COMMENT       ""
-#define BENCHMARK_LENGTH        0x107
+#define BENCHMARK_LENGTH        0x507

And we'll also need to enable "Many salts" vs. "Only one salt" benchmarks where appropriate.

solardiz commented 5 years ago

With the benchmarks discrepancy avoided per the above comment, I'm getting only these worse than 10% regressions and better than 10x speedups for --enable-openmp-for-fast-formats, 32 threads:

Ratio:  0.05363 real, 0.05307 virtual   OpenBSD-SoftRAID (8192 iterations):Raw
Ratio:  0.26248 real, 0.83302 virtual   dynamic_1501:Only one salt
Ratio:  0.28686 real, 8.66268 virtual   hdaa, HTTP Digest access authentication:Many salts
Ratio:  0.31898 real, 0.98354 virtual   dynamic_1503:Only one salt
Ratio:  0.34193 real, 0.96425 virtual   dynamic_1502:Only one salt
Ratio:  0.39385 real, 8.57376 virtual   hdaa, HTTP Digest access authentication:Only one salt
Ratio:  0.67551 real, 0.66041 virtual   PKZIP:Many salts
Ratio:  0.73272 real, 0.73294 virtual   ODF, OpenDocument Star/Libre/OpenOffice:Raw
Ratio:  0.74095 real, 0.91186 virtual   gost, GOST R 34.11-94:Raw
Ratio:  0.80591 real, 0.80462 virtual   krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Many salts
Ratio:  0.80844 real, 1.09248 virtual   skey, S/Key:Raw
Ratio:  0.84456 real, 0.69871 virtual   netlmv2, LMv2 C/R:Many salts
Ratio:  111.72218 real, 113.32930 virtual       vtp, "MD5 based authentication" VTP:Many salts
Ratio:  12.42245 real, 0.42769 virtual  EPI, EPiServer SID:Many salts
Ratio:  18.00868 real, 0.56399 virtual  SunMD5:Raw
Ratio:  21.78571 real, 0.68531 virtual  oracle, Oracle 10:Raw
Ratio:  31.25746 real, 31.09155 virtual PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+:Raw
Ratio:  35.20644 real, 1.10805 virtual  hMailServer:Many salts

The slowdown at "gost" is expected - we were benchmarking it wrongly in 1.8.0-jumbo-1 (at impossible salt count). The rest of slowdowns should be looked into.

solardiz commented 5 years ago

"OpenBSD-SoftRAID" is new with the benchmarks fix. Previously, it supported only one KDF type and had only one test vector. Now it has several, and the second test vector is for bcrypt-pbkdf, so is slower. With the benchmarks fix, we exposed its use, and speeds changed. No issue there.

1.8.0-jumbo-1:

Benchmarking: OpenBSD-SoftRAID (8192 iterations) [PBKDF2-SHA1 4x SSE2]... (32xOMP) DONE
Raw:    10368 c/s real, 326 c/s virtual

Now, pre-fix ("Raw" means "Only one salt", so uses the same one test vector as before):

Benchmarking: OpenBSD-SoftRAID (8192 iterations) [PBKDF2-SHA1 128/128 AVX 4x]... (32xOMP) DONE
Speed for cost 1 (kdf) of 1 and 3, cost 2 (iteration count) of 8192 and 16
Raw:    10240 c/s real, 318 c/s virtual

(The "Speed for ..." comment was wrong.)

Now, post-fix ("Raw" means "Many salts", so uses the first two test vectors' salts):

Benchmarking: OpenBSD-SoftRAID (8192 iterations) [PBKDF2-SHA1 128/128 AVX 4x]... (32xOMP) DONE
Speed for cost 1 (kdf) of 1 and 3, cost 2 (iteration count) of 8192 and 16
Raw:    556 c/s real, 17.3 c/s virtual

(The same "Speed for ..." comment is now correct.)

solardiz commented 5 years ago

There's a different issue, though: when we mix costs that are so different in one benchmark, we might not notice future performance regressions at the faster cost as the reported speed is dominated by the slower cost. Ideally, we'd report per-cost-setting speeds or have separate formats when the underlying algorithms are that different. It makes little sense to actually mix costs that are so different in an attack anyway, and having separate formats would discourage such misguided mixing. To avoid code duplication, we could have the same source file provide two format structs - or is that unsupported by our configure's formats plugins logic?

solardiz commented 5 years ago

BTW, I thought of changing "Raw" back to equivalent of "Many salts" only for legacy benchmarks, but now I realized that our "Speed for cost ..." reporting assumes we always use the first two salts, so I'll change "Raw" back to equivalent of "Many salts" for all benchmarks (legacy and new), like the patch above does.

solardiz commented 5 years ago

I also re-ran my tests for non-OpenMP after the benchmark fix:

Ratio:  0.03552 real, 0.03590 virtual   OpenBSD-SoftRAID (8192 iterations):Raw
Ratio:  0.65194 real, 0.65194 virtual   PKZIP:Many salts
Ratio:  0.71767 real, 0.71767 virtual   dynamic_35:Many salts
Ratio:  0.72039 real, 0.72039 virtual   dynamic_36:Many salts
Ratio:  0.72759 real, 0.72759 virtual   dynamic_2005:Many salts
Ratio:  0.73123 real, 0.73123 virtual   dynamic_2014:Many salts
Ratio:  0.73183 real, 0.73914 virtual   dynamic_15:Many salts
Ratio:  0.75081 real, 0.75081 virtual   dynamic_1504:Many salts
Ratio:  0.75725 real, 0.75725 virtual   dynamic_37:Many salts
Ratio:  0.75901 real, 0.75901 virtual   dynamic_25:Many salts
Ratio:  0.75928 real, 0.75928 virtual   dynamic_24:Many salts
Ratio:  0.76089 real, 0.76089 virtual   dynamic_1401:Many salts
Ratio:  0.78435 real, 0.78435 virtual   ODF, OpenDocument Star/Libre/OpenOffice:Raw
Ratio:  0.78830 real, 0.79614 virtual   net-sha1, "Keyed SHA1" BFD:Many salts
Ratio:  0.79592 real, 0.79592 virtual   dynamic_2004:Many salts
Ratio:  0.79621 real, 0.79621 virtual   dynamic_2009:Many salts
Ratio:  0.80745 real, 0.80745 virtual   dynamic_1016:Many salts
Ratio:  0.80892 real, 0.80892 virtual   dynamic_40:Many salts
Ratio:  0.82447 real, 0.82447 virtual   dynamic_2010:Many salts
Ratio:  0.82504 real, 0.82504 virtual   dynamic_2011:Many salts
Ratio:  0.83207 real, 0.82367 virtual   ZIP, WinZip:Raw
Ratio:  0.83276 real, 0.83276 virtual   dynamic_16:Many salts
Ratio:  0.83299 real, 0.83299 virtual   PKZIP:Only one salt
Ratio:  0.83807 real, 0.83807 virtual   dynamic_2008:Many salts
Ratio:  0.83907 real, 0.83907 virtual   dynamic_61:Many salts
Ratio:  0.84707 real, 0.84707 virtual   dynamic_2001:Many salts
Ratio:  0.84832 real, 0.84832 virtual   Blockchain, My Wallet (x10):Raw
Ratio:  0.86600 real, 0.86600 virtual   dynamic_1401:Only one salt
Ratio:  0.87394 real, 0.87394 virtual   dynamic_1501:Only one salt
Ratio:  0.88527 real, 0.88527 virtual   dynamic_2006:Many salts
Ratio:  13.12530 real, 13.12530 virtual SybaseASE, Sybase ASE:Many salts
Ratio:  135.64527 real, 135.64527 virtual       vtp, "MD5 based authentication" VTP:Many salts
Ratio:  32.04659 real, 31.72982 virtual PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+:Raw

Besides the dynamics, these 6 should be looked into:

Ratio:  0.65194 real, 0.65194 virtual   PKZIP:Many salts
Ratio:  0.78435 real, 0.78435 virtual   ODF, OpenDocument Star/Libre/OpenOffice:Raw
Ratio:  0.78830 real, 0.79614 virtual   net-sha1, "Keyed SHA1" BFD:Many salts
Ratio:  0.83207 real, 0.82367 virtual   ZIP, WinZip:Raw
Ratio:  0.83299 real, 0.83299 virtual   PKZIP:Only one salt
Ratio:  0.84832 real, 0.84832 virtual   Blockchain, My Wallet (x10):Raw

We already have issues opened for several of them, but "ODF" and "ZIP" are new here - it might be something similar to "OpenBSD-SoftRAID" with the second test vector exposing different costs, but this needs to be confirmed.

solardiz commented 5 years ago

For "ODF", we previously had two formats (with the same label?!)

Benchmarking: ODF [SHA1 BF / SHA256 AES 4x SSE2]... DONE
Raw:    5680 c/s real, 5680 c/s virtual

Benchmarking: ODF, OpenDocument Star/Libre/OpenOffice [SHA1 Blowfish 4x SSE2]... DONE
Raw:    7104 c/s real, 7104 c/s virtual

Now it's just one:

Benchmarking: ODF, OpenDocument Star/Libre/OpenOffice [PBKDF2-SHA1 128/128 AVX 4x BF/AES]... DONE
Speed for cost 1 (iteration count) of 1024, cost 2 (crypto [0=Blowfish, 1=AES]) of 0 and 1
Raw:    5572 c/s real, 5572 c/s virtual

I wonder why it doesn't produce an inbetween speed, but for now I'll just assume it's different and not trivial to compare to what we had before.

solardiz commented 5 years ago

"ZIP" regression from:

Benchmarking: ZIP, WinZip [PBKDF2-SHA1 4x SSE2]... DONE
Raw:    3692 c/s real, 3692 c/s virtual

to:

Benchmarking: ZIP, WinZip [PBKDF2-SHA1 128/128 AVX 4x]... DONE
Raw:    3072 c/s real, 3041 c/s virtual

No mention of multiple tunable costs, although I think there are some, @frank-dittrich.

Just before the benchmark fix:

Benchmarking: ZIP, WinZip [PBKDF2-SHA1 128/128 AVX 4x]... DONE
Raw:    3548 c/s real, 3548 c/s virtual

The first two test vectors appear unchanged. Old:

$ egrep '9ffba|855f6' ../src/zip_fmt_plug.c|md5sum
ea90d8b2fd5d0e0a977f1cac67aca045  -

New:

$ egrep '9ffba|855f6' ../src/pkzip.c|md5sum
ea90d8b2fd5d0e0a977f1cac67aca045  -

I'm puzzled, and will appreciate it if someone else looks into this. Edit: this is now #3818.

solardiz commented 5 years ago

As I understand, we figured out and accepted or otherwise took care of all but dynamic formats regressions (#3814, stalled by @jfoug's unavailability).

In case we'd like to retest the latest code and possibly identify other regressions, we could want to unify "Many salts" to "Raw" in our --test output (for both old and new versions) before running relbench.

solardiz commented 5 years ago

I still wouldn't mind more work on this issue - such as to include specific relbench numbers in our release announcement. Also for OpenCL.

solardiz commented 5 years ago

We seem to have exposed more weirdness for OpenBSD-SoftRAID, maybe through the salt sorting fixes. I think the current behavior of benchmarks is correct - we've merely exposed the format's issues.

We now have BENCHMARK_COMMENT inconsistent with what's actually benchmarked. It's also unneeded because we have proper tunable costs reporting for this format.

Benchmarking: OpenBSD-SoftRAID (8192 iterations) [PBKDF2-SHA1 128/128 XOP 4x]... DONE
Speed for cost 1 (kdf) of 1 and 3, cost 2 (iteration count) of 8192 and 16
Raw:    24.1 c/s real, 24.1 c/s virtual

Changing 0x107 to 0x507, I get:

Benchmarking: OpenBSD-SoftRAID (8192 iterations) [PBKDF2-SHA1 128/128 XOP 4x]... DONE
Speed for cost 1 (kdf) of 1, cost 2 (iteration count) of 8192
Raw:    688 c/s real, 681 c/s virtual

This is now internally consistent, but it's a lot slower than what we had in 1.8.0-jumbo-1 (actual regression, it seems):

Benchmarking: OpenBSD-SoftRAID (8192 iterations) [PBKDF2-SHA1 8x SSE2]... DONE
Raw:    990 c/s real, 990 c/s virtual

(The "SSE2" reporting was probably a cosmetic error.)

This is in non-OpenMP builds. Hopefully, the regression would go away when running 2 threads/module (this is on FX-8120), which is probably why we ended up lowering the interleaving for SHA-1 here.

We should probably at least drop BENCHMARK_COMMENT and maybe change 0x107 to 0x507.

solardiz commented 5 years ago

There's similar slowdown seen on XOP for other iterated SHA-1 formats.

solardiz commented 5 years ago

The regression was introduced in b3fb17c7bb1d3537778408dfce3bc911529f0102:

commit b3fb17c7bb1d3537778408dfce3bc911529f0102
Author: magnum <john.magnum@hushmail.com>
Date:   Tue Jun 2 01:46:49 2015 +0200

    SHA1 intrinsics: Drop a whole array of redundant temp. We
    can use the existing pad array! And decrease SHA1_PARA to
    1 because that's faster now.

solardiz commented 5 years ago

I think I'll use:

diff --git a/src/openbsdsoftraid_fmt_plug.c b/src/openbsdsoftraid_fmt_plug.c
index 625210c..f353188 100644
--- a/src/openbsdsoftraid_fmt_plug.c
+++ b/src/openbsdsoftraid_fmt_plug.c
@@ -47,8 +47,8 @@ john_register_one(&fmt_openbsd_softraid);
 #else
 #define ALGORITHM_NAME              "PBKDF2-SHA1 32/" ARCH_BITS_STR
 #endif
-#define BENCHMARK_COMMENT           " (8192 iterations)"
-#define BENCHMARK_LENGTH            0x107
+#define BENCHMARK_COMMENT           ""
+#define BENCHMARK_LENGTH            0x507
 #define PLAINTEXT_LENGTH            125
 #define SALT_SIZE                   sizeof(struct custom_salt)
 #define SALT_ALIGN                  4
diff --git a/src/x86-64.h b/src/x86-64.h
index 1683a9b..f77c44a 100644
--- a/src/x86-64.h
+++ b/src/x86-64.h
@@ -269,10 +269,8 @@
 #define SIMD_PARA_SHA1                 2
 #elif defined(__llvm__)
 #define SIMD_PARA_SHA1                 2
-#elif defined(__GNUC__) && GCC_VERSION < 40504 // 4.5.4
-#define SIMD_PARA_SHA1                 1
-#elif !defined(__AVX__) && defined(__GNUC__) && GCC_VERSION > 40700 // 4.7.0
-#define SIMD_PARA_SHA1                 1
+#elif defined(__XOP__)
+#define SIMD_PARA_SHA1                 2
 #else
 #define SIMD_PARA_SHA1                 1
 #endif

solardiz commented 5 years ago

This change provides good speedup for raw-SHA1. And we need to retest AVX on Intel - maybe there's missed speedup opportunity there.

solardiz commented 5 years ago

No, looks like SHA-1 x1 is faster on Intel AVX - ran 4 tests on super (two gcc versions, raw-sha1 non-OMP and mscash2 OMP). So we only need the change for XOP.

solardiz commented 5 years ago

With the fix above, XOP non-OMP with ./benchmark-unify | sed 's/Many salts/Raw/' on 1.8.0-jumbo-1 vs. 1.9.0-jumbo-1 with Benchmarks_1_8 = Y:

Number of benchmarks:           374
Minimum:                        0.59245 real, 0.59245 virtual
Maximum:                        133.71622 real, 133.71622 virtual
Median:                         1.05118 real, 1.05248 virtual
Median absolute deviation:      0.06856 real, 0.06788 virtual
Geometric mean:                 1.26666 real, 1.26680 virtual
Geometric standard deviation:   1.78772 real, 1.78789 virtual

Worse than 2% regressions:

Ratio:  0.59245 real, 0.59245 virtual   PKZIP:Raw
Ratio:  0.67494 real, 0.66749 virtual   tc_whirlpool, TrueCrypt AES256_XTS:Raw
Ratio:  0.71919 real, 0.71919 virtual   PKZIP:Only one salt
Ratio:  0.72432 real, 0.72432 virtual   Raw-Keccak:Raw
Ratio:  0.73344 real, 0.73344 virtual   Raw-Keccak-256:Raw
Ratio:  0.80089 real, 0.80089 virtual   Panama:Raw
Ratio:  0.86031 real, 0.86058 virtual   ZIP, WinZip:Raw
Ratio:  0.87716 real, 0.87716 virtual   ODF, OpenDocument Star/Libre/OpenOffice:Raw
Ratio:  0.88119 real, 0.88119 virtual   net-sha1, "Keyed SHA1" BFD:Raw
Ratio:  0.88400 real, 0.88400 virtual   krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Raw
Ratio:  0.89250 real, 0.89250 virtual   netntlmv2, NTLMv2 C/R:Only one salt
Ratio:  0.89883 real, 0.89883 virtual   dynamic_1504:Raw
Ratio:  0.90098 real, 0.90098 virtual   dynamic_1501:Only one salt
Ratio:  0.90192 real, 0.91103 virtual   dynamic_24:Raw
Ratio:  0.90217 real, 0.90217 virtual   krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Only one salt
Ratio:  0.90628 real, 0.90628 virtual   dynamic_35:Raw
Ratio:  0.90769 real, 0.90769 virtual   dynamic_36:Raw
Ratio:  0.90775 real, 0.90775 virtual   dynamic_37:Raw
Ratio:  0.91296 real, 0.91296 virtual   dynamic_40:Raw
Ratio:  0.91732 real, 0.91732 virtual   dynamic_38:Raw
Ratio:  0.93094 real, 0.93094 virtual   dynamic_38:Only one salt
Ratio:  0.93271 real, 0.93271 virtual   AFS, Kerberos AFS:Long
Ratio:  0.93359 real, 0.93359 virtual   EPI, EPiServer SID:Raw
Ratio:  0.93462 real, 0.93462 virtual   dynamic_1502:Only one salt
Ratio:  0.93588 real, 0.93588 virtual   dynamic_140:Raw
Ratio:  0.93906 real, 0.93906 virtual   dynamic_1504:Only one salt
Ratio:  0.94394 real, 0.94394 virtual   dynamic_25:Raw
Ratio:  0.94906 real, 0.94906 virtual   o5logon, Oracle O5LOGON protocol:Raw
Ratio:  0.94926 real, 0.94926 virtual   dynamic_36:Only one salt
Ratio:  0.95358 real, 0.95358 virtual   dynamic_35:Only one salt
Ratio:  0.95448 real, 0.95448 virtual   dynamic_2009:Raw
Ratio:  0.95465 real, 0.95465 virtual   dynamic_37:Only one salt
Ratio:  0.95525 real, 0.95525 virtual   MongoDB, system / network:Raw
Ratio:  0.95716 real, 0.95716 virtual   nk, Nuked-Klan CMS:Raw
Ratio:  0.96078 real, 0.96078 virtual   Citrix_NS10, Netscaler 10:Raw
Ratio:  0.96549 real, 0.96549 virtual   dynamic_1401:Raw
Ratio:  0.96683 real, 0.96683 virtual   wpapsk, WPA/WPA2/PMF/PMKID PSK:Raw
Ratio:  0.96850 real, 0.96850 virtual   dynamic_15:Raw
Ratio:  0.96876 real, 0.96876 virtual   RAKP, IPMI 2.0 RAKP (RMCP+):Only one salt
Ratio:  0.96883 real, 0.96883 virtual   dynamic_1027:Raw
Ratio:  0.96978 real, 0.96978 virtual   dynamic_26:Raw
Ratio:  0.97017 real, 0.97017 virtual   dynamic_25:Only one salt
Ratio:  0.97044 real, 0.97044 virtual   dynamic_18:Raw
Ratio:  0.97068 real, 0.97068 virtual   net-sha1, "Keyed SHA1" BFD:Only one salt
Ratio:  0.97143 real, 0.97143 virtual   mscash2, MS Cache Hash 2 (DCC2):Raw
Ratio:  0.97289 real, 0.98273 virtual   fde, Android FDE:Raw
Ratio:  0.97305 real, 0.97305 virtual   dynamic_1028:Raw
Ratio:  0.97430 real, 0.97430 virtual   dynamic_2014:Raw
Ratio:  0.97570 real, 0.97570 virtual   xsha, Mac OS X 10.4 - 10.6:Raw
Ratio:  0.97722 real, 0.97722 virtual   netlmv2, LMv2 C/R:Only one salt

Some of these are known to us and aren't actually regressions (things changed in other ways). Most are probably for real.

Edit: re-ran the benchmarks as the first time there appeared to be a glitch (other load?) causing at least AFS to appear slower than actual.

solardiz commented 5 years ago

mscash2, old:

Benchmarking: mscash2, MS Cache Hash 2 (DCC2) [PBKDF2-SHA1 128/128 XOP 8x]... DONE
Raw:    1680 c/s real, 1680 c/s virtual

new:

Benchmarking: mscash2, MS Cache Hash 2 (DCC2) [PBKDF2-SHA1 128/128 XOP 4x2]... DONE
Raw:    1631 c/s real, 1631 c/s virtual

This is entirely reproducible. And things were far worse before I increased the SHA-1 interleaving back to 2x.

solardiz commented 5 years ago

It's puzzling that a mscash2 regression is still present after my SHA-1 interleaving increase, because raw-sha1 actually became faster than before. 1.8.0-jumbo-1:

Benchmarking: Raw-SHA1 [SHA1 128/128 XOP 8x]... DONE
Raw:    17372K c/s real, 17372K c/s virtual

now:

Benchmarking: Raw-SHA1 [SHA1 128/128 XOP 4x2]... DONE
Raw:    22998K c/s real, 22998K c/s virtual

magnumripper commented 5 years ago

Looking at diffs of mscash2, we bumped max. salt size (i.e. login name) significantly. Before that, a user name like "Administrator" in Greek or Russian would be uncrackable if encoded with UTF-8 because it got too long. Other than that it's just clean-ups. Oh BTW I think our GETPOS macros might not be quite as optimizable by the compiler now, because the older way to write them could not account for all vector sizes. This should affect all formats though.

magnumripper commented 5 years ago

a user name like "Administrator" in Greek or Russian would be uncrackable if encoded with UTF-8

I must recall the above incorrectly: Regardless of input file format, it's encoded as UTF-16 so Greek or Russian wouldn't matter. Anyway we did bump the max salt length.

openwall / john

Look for performance regressions compared to 1.8.0-jumbo-1 #2914