Closed jfoug closed 9 years ago
This may not fit here 100%, but it is where i am putting it now. Here are benchmark tests run, against 3 runs, on cygwin64 and Ubuntu-64 (virtualbox VM), on my AMD xop dual laptop. There are some MAX_KEYS and OMP_SCALE things we should address here. I want to get the same data from my core-i7 quad HT at work, since it will likely scale differently. Then we can smooth some of these issues out, even before starting on auto-scaling for OMP.
FIXED:
Ratio: 0.15884 real, 0.13887 virtual openssl-enc, OpenSSL "enc" encryption:Raw
Ratio: 1.00000 real, 0.25000 virtual openssl-enc, OpenSSL "enc" encryption:Raw
This one was cygwin only. Worked fine on Ubuntu. I added a #ifdef __CYGWIN__ to turn off OMP
all tc_* (true crypt) hashes on cygwin
cygwin64 $ ../run/relbench -v omp-1-xop.log omp-0-xop.log
Ratio: 1.85919 real, 1.85881 virtual Blockchain, My Wallet (x10):Raw
Ratio: 2.18651 real, 2.18615 virtual EPiServer:Many salts
Ratio: 2.19742 real, 2.21316 virtual EPiServer:Only one salt
Ratio: 1.34197 real, 1.34105 virtual HMAC-MD5:Only one salt
Ratio: 1.33636 real, 1.33969 virtual HMAC-SHA1:Only one salt
Ratio: 1.28210 real, 1.28137 virtual HMAC-SHA224:Many salts
Ratio: 1.24982 real, 1.24792 virtual HMAC-SHA512:Many salts
Ratio: 1.15183 real, 1.15139 virtual HMAC-SHA512:Only one salt
Ratio: 1.35976 real, 1.35325 virtual LM:Raw
Ratio: 0.89873 real, 0.89687 virtual LastPass, sniffed sessions:Raw
Ratio: 1.96742 real, 1.96912 virtual MongoDB, system / network:Raw
Ratio: 1.77983 real, 1.77802 virtual PST, custom CRC-32:Raw
Ratio: 1.92509 real, 1.91920 virtual RACF:Many salts
Ratio: 1.64315 real, 1.64180 virtual RACF:Only one salt
Ratio: 1.31663 real, 1.31514 virtual RAKP, IPMI 2.0 RAKP (RMCP+):Only one salt
Ratio: 0.87120 real, 0.86849 virtual Raw-Blake2:Raw
Ratio: 0.80101 real, 0.80025 virtual Raw-Keccak:Raw
Ratio: 0.86252 real, 0.86252 virtual Raw-Keccak-256:Raw
Ratio: 1.19374 real, 1.18459 virtual Raw-MD4:Raw
Ratio: 0.92907 real, 0.92695 virtual Raw-SHA512:Raw
Ratio: 0.91581 real, 0.91350 virtual Raw-SHA1-ng, (pwlen <= 15):Raw
Ratio: 1.15896 real, 1.16040 virtual Raw-SHA512-ng:Raw
Ratio: 0.91991 real, 0.91628 virtual SIP:Only one salt
Ratio: 0.90118 real, 0.90389 virtual SSH (one 2048-bit RSA and one 1024-bit DSA key):Raw
Ratio: 4.09753 real, 4.09598 virtual SSH-ng:Raw
Ratio: 0.94087 real, 0.94382 virtual Snefru-128:Raw
Ratio: 0.84281 real, 0.84123 virtual Snefru-256:Raw
Ratio: 0.92757 real, 0.92518 virtual Sybase-PROP:Many salts
Ratio: 0.92937 real, 0.92280 virtual Sybase-PROP:Only one salt
Ratio: 0.91833 real, 0.91758 virtual Tiger:Raw
Ratio: 1.15613 real, 1.15681 virtual agilekeychain, 1Password Agile Keychain:Raw
Ratio: 1.17803 real, 1.17727 virtual bsdicrypt, BSDI crypt(3) ("_J9..", 725 iterations):Many salts
Ratio: 2.35986 real, 2.37017 virtual chap, iSCSI CHAP authentication:Raw
Ratio: 1.18942 real, 1.18884 virtual dragonfly3-32, DragonFly BSD $3$ w/ bug, 32-bit:Many salts
Ratio: 1.21620 real, 1.21620 virtual dragonfly3-32, DragonFly BSD $3$ w/ bug, 32-bit:Only one salt
Ratio: 1.30824 real, 1.30745 virtual dragonfly3-64, DragonFly BSD $3$ w/ bug, 64-bit:Many salts
Ratio: 1.18535 real, 1.18331 virtual dragonfly3-64, DragonFly BSD $3$ w/ bug, 64-bit:Only one salt
Ratio: 1.16511 real, 1.16313 virtual dragonfly4-32, DragonFly BSD $4$ w/ bugs, 32-bit:Many salts
Ratio: 1.16561 real, 1.16455 virtual dragonfly4-64, DragonFly BSD $4$ w/ bugs, 64-bit:Only one salt
Ratio: 1.15637 real, 1.16207 virtual fde, Android FDE:Raw
Ratio: 1.11659 real, 1.11406 virtual hdaa, HTTP Digest access authentication:Many salts
Ratio: 1.10031 real, 1.10060 virtual ipb2, Invision Power Board 2.x:Many salts
Ratio: 1.11996 real, 1.12052 virtual keychain, Mac OS X Keychain:Raw
Ratio: 1.21872 real, 1.21897 virtual keystore, Java KeyStore:Raw
Ratio: 1.14208 real, 1.13958 virtual lotus85, Lotus Notes/Domino 8.5:Raw
Ratio: 1.71362 real, 1.71173 virtual mscash, MS Cache Hash (DCC):Many salts
Ratio: 1.54250 real, 1.53922 virtual mscash, MS Cache Hash (DCC):Only one salt
Ratio: 1.31895 real, 1.31614 virtual mschapv2-naive, MSCHAPv2 C/R:Many salts
Ratio: 1.54376 real, 1.54425 virtual mssql12, MS SQL 2012/2014:Many salts
Ratio: 1.32595 real, 1.32425 virtual mssql12, MS SQL 2012/2014:Only one salt
Ratio: 1.59822 real, 1.60298 virtual mysqlna, MySQL Network Authentication:Raw
Ratio: 1.29011 real, 1.28920 virtual net-md5, "Keyed MD5" RIPv2, OSPF, BGP, SNMPv2:Many salts
Ratio: 1.22309 real, 1.22330 virtual net-md5, "Keyed MD5" RIPv2, OSPF, BGP, SNMPv2:Only one salt
Ratio: 1.39862 real, 1.39617 virtual net-sha1, "Keyed SHA1" BFD:Many salts
Ratio: 1.43371 real, 1.43077 virtual net-sha1, "Keyed SHA1" BFD:Only one salt
Ratio: 1.16528 real, 1.16641 virtual nethalflm, HalfLM C/R:Many salts
Ratio: 1.22222 real, 1.22358 virtual netlm, LM C/R:Many salts
Ratio: 1.18733 real, 1.18456 virtual netlm, LM C/R:Only one salt
Ratio: 1.30871 real, 1.31416 virtual netntlm-naive, NTLMv1 C/R:Many salts
Ratio: 1.15936 real, 1.15877 virtual netntlmv2, NTLMv2 C/R:Many salts
Ratio: 1.37253 real, 1.36914 virtual nt2, NT:Raw
Ratio: 1.16799 real, 1.16840 virtual oldoffice, MS Office <= 2003:Many salts
Ratio: 1.18387 real, 1.18222 virtual openssl-enc, OpenSSL "enc" encryption:Raw
Ratio: 1.70178 real, 1.69890 virtual postgres, PostgreSQL C/R:Raw
Ratio: 0.91070 real, 0.90869 virtual ripemd-128, RIPEMD 128:Raw
Ratio: 0.94408 real, 0.94032 virtual ripemd-160, RIPEMD 160:Raw
Ratio: 0.87103 real, 0.86959 virtual skein-256, Skein 256:Raw
Ratio: 0.88118 real, 0.87953 virtual skein-512, Skein 512:Raw
Ratio: 0.90698 real, 0.89862 virtual tc_ripemd160, TrueCrypt RIPEMD160 AES256_XTS:Raw
Ratio: 0.88230 real, 0.88230 virtual tc_sha512, TrueCrypt SHA512 AES256_XTS:Raw
Ratio: 0.89068 real, 0.88782 virtual tc_whirlpool, TrueCrypt WHIRLPOOL AES256_XTS:Raw
Ratio: 0.95394 real, 0.95651 virtual tcp-md5, TCP MD5 Signatures, BGP:Many salts
Ratio: 0.93251 real, 0.93166 virtual tcp-md5, TCP MD5 Signatures, BGP:Only one salt
Ratio: 1.15416 real, 1.15516 virtual tripcode:Raw
Ratio: 0.88678 real, 0.88869 virtual vtp, "MD5 based authentication" VTP:Many salts
Ratio: 0.90883 real, 0.90864 virtual vtp, "MD5 based authentication" VTP:Only one salt
Ratio: 0.89586 real, 0.89650 virtual whirlpool1:Raw
Ratio: 1.34711 real, 1.34591 virtual xsha, Mac OS X 10.4 - 10.6:Many salts
Ratio: 1.25367 real, 1.25248 virtual xsha, Mac OS X 10.4 - 10.6:Only one salt
Ratio: 0.91102 real, 0.90596 virtual xsha512, Mac OS X 10.7:Only one salt
cygwin64 $ ../run/relbench -v omp-4-xop.log omp-0-xop.log
Ratio: 1.11980 real, 2.83895 virtual Blockchain, My Wallet (x10):Raw
Ratio: 1.50262 real, 3.07782 virtual EPiServer:Many salts
Ratio: 1.43417 real, 3.09400 virtual EPiServer:Only one salt
Ratio: 1.03194 real, 1.99459 virtual HMAC-SHA1:Only one salt
Ratio: 1.07199 real, 2.05080 virtual LM:Raw
Ratio: 1.15328 real, 2.61449 virtual MongoDB, system / network:Raw
Ratio: 1.08891 real, 2.30125 virtual PFX, PKCS12 (.pfx, .p12):Raw
Ratio: 1.28908 real, 1.89985 virtual PST, custom CRC-32:Raw
Ratio: 1.04591 real, 2.40295 virtual RACF:Many salts
Ratio: 1.06827 real, 2.23486 virtual RACF:Only one salt
Ratio: 1.02438 real, 1.02400 virtual SSH-ng:Raw
Ratio: 1.61978 real, 3.48579 virtual chap, iSCSI CHAP authentication:Raw
Ratio: 1.26333 real, 2.29408 virtual cq, ClearQuest:Raw
Ratio: 1.08666 real, 2.48443 virtual mscash, MS Cache Hash (DCC):Many salts
Ratio: 0.95866 real, 2.10208 virtual mscash, MS Cache Hash (DCC):Only one salt
Ratio: 0.95244 real, 1.89957 virtual net-sha1, "Keyed SHA1" BFD:Only one salt
Ratio: 1.02455 real, 2.39711 virtual postgres, PostgreSQL C/R:Raw
$ ../run/relbench -v omp-1-u64-vm-sse41.log omp-4-u64-vm-sse41.log
Ratio: 0.53856 real, 0.13893 virtual CRC32:Only one salt
Ratio: 1.01620 real, 0.25885 virtual Citrix_NS10, Netscaler 10:Only one salt
Ratio: 0.70132 real, 0.25495 virtual Fortigate, FortiOS:Many salts
Ratio: 0.44795 real, 0.16514 virtual Fortigate, FortiOS:Only one salt
Ratio: 0.65770 real, 0.24255 virtual HAVAL-128-4:Raw
Ratio: 0.58542 real, 0.21491 virtual HAVAL-256-3:Raw
Ratio: 0.74744 real, 0.27421 virtual HMAC-MD5:Many salts
Ratio: 0.63618 real, 0.24241 virtual HMAC-MD5:Only one salt
Ratio: 0.72531 real, 0.27220 virtual HMAC-SHA1:Many salts
Ratio: 0.56471 real, 0.21481 virtual HMAC-SHA1:Only one salt
Ratio: 0.44187 real, 0.11733 virtual LM:Raw
Ratio: 0.56230 real, 0.14567 virtual PST, custom CRC-32:Raw
Ratio: 0.58217 real, 0.14959 virtual Raw-MD4:Raw
Ratio: 0.72124 real, 0.18656 virtual Raw-MD5:Raw
Ratio: 1.05441 real, 0.27153 virtual Raw-SHA1:Raw
Ratio: 0.89824 real, 0.24002 virtual Raw-SHA256-ng:Raw
Ratio: 0.89948 real, 0.32948 virtual gost, GOST R 34.11-94:Raw
Ratio: 0.76972 real, 0.28384 virtual hdaa, HTTP Digest access authentication:Many salts
Ratio: 0.75825 real, 0.28126 virtual hdaa, HTTP Digest access authentication:Only one salt
Ratio: 1.00578 real, 0.26544 virtual mschapv2-naive, MSCHAPv2 C/R:Only one salt
Ratio: 1.00074 real, 0.29016 virtual nethalflm, HalfLM C/R:Only one salt
Ratio: 0.95518 real, 0.37010 virtual netlm, LM C/R:Only one salt
Ratio: 0.99599 real, 0.25135 virtual xsha, Mac OS X 10.4 - 10.6:Only one salt
../run/relbench -v omp-1-u64-vm-sse41.log omp-0-u64-vm-sse41.log
Ratio: 0.88462 real, 0.91026 virtual 7z, 7-Zip (512K iterations):Raw
Ratio: 0.76471 real, 0.81176 virtual Bitcoin:Raw
Ratio: 0.86384 real, 0.90940 virtual Blockchain, My Wallet (x10):Raw
Ratio: 0.88777 real, 0.92107 virtual CRC32:Many salts
Ratio: 0.88885 real, 0.90889 virtual CRC32:Only one salt
Ratio: 0.89803 real, 0.92203 virtual Clipperz, SRP:Raw
Ratio: 0.97037 real, 0.97778 virtual EncFS:Raw
Ratio: 0.98929 real, 0.98735 virtual Fortigate, FortiOS:Many salts
Ratio: 0.97571 real, 0.97976 virtual Fortigate, FortiOS:Only one salt
Ratio: 0.83792 real, 0.83980 virtual HAVAL-128-4:Raw
Ratio: 0.87919 real, 0.88427 virtual HAVAL-256-3:Raw
Ratio: 1.33038 real, 1.33026 virtual HMAC-MD5:Only one salt
Ratio: 1.14986 real, 1.15024 virtual HMAC-SHA1:Only one salt
Ratio: 1.16423 real, 1.16890 virtual HMAC-SHA512:Many salts
Ratio: 1.19083 real, 1.19560 virtual HMAC-SHA512:Only one salt
Ratio: 0.76804 real, 0.76651 virtual IKE, PSK:Raw
Ratio: 0.88045 real, 0.92872 virtual LM:Raw
Ratio: 0.89655 real, 0.89655 virtual LUKS:Raw
Ratio: 0.81342 real, 0.83502 virtual LastPass, sniffed sessions:Raw
Ratio: 0.86168 real, 0.85949 virtual Office, 2007/2010 (SHA-1) / 2013 (SHA-512), with AES:Raw
Ratio: 0.86985 real, 0.90815 virtual OpenVMS, Purdy:Raw
Ratio: 0.87207 real, 0.91026 virtual PBKDF2-HMAC-SHA1:Raw
Ratio: 0.72477 real, 0.74429 virtual PBKDF2-HMAC-SHA256, rounds=12000:Raw
Ratio: 0.84516 real, 0.85806 virtual PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+:Raw
Ratio: 0.87224 real, 0.89544 virtual PFX, PKCS12 (.pfx, .p12):Raw
Ratio: 0.88536 real, 0.91468 virtual PKZIP:Only one salt
Ratio: 1.25522 real, 1.32128 virtual PST, custom CRC-32:Raw
Ratio: 0.85991 real, 0.87929 virtual Panama:Raw
Ratio: 1.27746 real, 1.35625 virtual PuTTY, Private Key:Raw
Ratio: 0.85978 real, 0.90130 virtual RACF:Only one salt
Ratio: 0.87630 real, 0.88879 virtual RAKP, IPMI 2.0 RAKP (RMCP+):Many salts
Ratio: 0.86953 real, 0.90583 virtual Raw-Keccak:Raw
Ratio: 0.85778 real, 0.92249 virtual Raw-Keccak-256:Raw
Ratio: 0.82277 real, 0.88470 virtual Raw-MD4:Raw
Ratio: 0.74434 real, 0.81807 virtual Raw-SHA224:Raw
Ratio: 0.75153 real, 0.78938 virtual Raw-SHA256:Raw
Ratio: 0.76071 real, 0.79268 virtual Raw-SHA384:Raw
Ratio: 0.86875 real, 0.93477 virtual Raw-SHA512:Raw
Ratio: 0.78627 real, 0.81560 virtual Raw-SHA1-ng, (pwlen <= 15):Raw
Ratio: 0.78804 real, 0.85467 virtual Raw-SHA256-ng:Raw
Ratio: 0.83193 real, 0.87252 virtual Raw-SHA512-ng:Raw
Ratio: 0.84257 real, 0.88728 virtual SIP:Many salts
Ratio: 0.84481 real, 0.86964 virtual SIP:Only one salt
Ratio: 0.66253 real, 0.67333 virtual SSH (one 2048-bit RSA and one 1024-bit DSA key):Raw
Ratio: 0.68603 real, 0.70730 virtual SSH-ng:Raw
Ratio: 0.86412 real, 0.90380 virtual SSHA512, LDAP:Many salts
Ratio: 0.83089 real, 0.87967 virtual SSHA512, LDAP:Only one salt
Ratio: 0.88188 real, 0.91276 virtual STRIP, Password Manager:Raw
Ratio: 0.89958 real, 0.93912 virtual Salted-SHA1:Many salts
Ratio: 0.89833 real, 0.93783 virtual Salted-SHA1:Only one salt
Ratio: 0.82300 real, 0.83471 virtual Siemens-S7:Many salts
Ratio: 0.84723 real, 0.86098 virtual Siemens-S7:Only one salt
Ratio: 0.78742 real, 0.82366 virtual Snefru-256:Raw
Ratio: 0.86935 real, 0.88899 virtual Sybase-PROP:Many salts
Ratio: 0.71526 real, 0.77302 virtual Tiger:Raw
Ratio: 0.86037 real, 0.88894 virtual WoWSRP, Battlenet:Raw
Ratio: 0.82293 real, 0.89155 virtual ZIP, WinZip:Raw
Ratio: 0.85227 real, 0.91822 virtual agilekeychain, 1Password Agile Keychain:Raw
Ratio: 0.79779 real, 0.81061 virtual aix-smd5, AIX LPA {smd5} (modified crypt-md5):Raw
Ratio: 0.84565 real, 0.92540 virtual aix-ssha1, AIX LPA {ssha1}:Raw
Ratio: 0.78509 real, 0.83887 virtual aix-ssha256, AIX LPA {ssha256}:Raw
Ratio: 0.79376 real, 0.85736 virtual aix-ssha512, AIX LPA {ssha512}:Raw
Ratio: 0.83851 real, 0.87771 virtual bcrypt ("$2a$05", 32 iterations):Raw
Ratio: 0.83942 real, 0.88749 virtual blackberry-es10:Raw
Ratio: 0.82927 real, 0.82927 virtual cloudkeychain, 1Password Cloud Keychain:Raw
Ratio: 0.47893 real, 0.49174 virtual cq, ClearQuest:Raw
Ratio: 0.88643 real, 0.93901 virtual crypt, generic crypt(3) DES:Many salts
Ratio: 0.85718 real, 0.92184 virtual crypt, generic crypt(3) DES:Only one salt
Ratio: 0.77505 real, 0.77550 virtual hsrp, "MD5 authentication" HSRP, VRRP, GLBP:Only one salt
Ratio: 0.89531 real, 0.89530 virtual krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Many salts
Ratio: 0.88705 real, 0.88705 virtual krb5pa-md5, Kerberos 5 AS-REQ Pre-Auth etype 23:Only one salt
Ratio: 0.81316 real, 0.87826 virtual md5crypt, crypt(3) $1$:Raw
Ratio: 0.65473 real, 0.65602 virtual mscash, MS Cache Hash (DCC):Many salts
Ratio: 0.80075 real, 0.79918 virtual mscash, MS Cache Hash (DCC):Only one salt
Ratio: 0.87783 real, 0.87429 virtual mssql12, MS SQL 2012/2014:Many salts
Ratio: 0.89158 real, 0.89324 virtual mssql12, MS SQL 2012/2014:Only one salt
Ratio: 1.26422 real, 1.27191 virtual net-md5, "Keyed MD5" RIPv2, OSPF, BGP, SNMPv2:Many salts
Ratio: 1.19603 real, 1.19560 virtual net-md5, "Keyed MD5" RIPv2, OSPF, BGP, SNMPv2:Only one salt
Ratio: 1.30800 real, 1.31040 virtual net-sha1, "Keyed SHA1" BFD:Many salts
Ratio: 1.25947 real, 1.25447 virtual net-sha1, "Keyed SHA1" BFD:Only one salt
Ratio: 0.84865 real, 0.84696 virtual netntlmv2, NTLMv2 C/R:Only one salt
Ratio: 0.86876 real, 0.90692 virtual oldoffice, MS Office <= 2003:Only one salt
Ratio: 0.82383 real, 0.86002 virtual openssl-enc, OpenSSL "enc" encryption:Raw
Ratio: 0.82195 real, 0.86364 virtual rar, RAR3 (4 characters):Raw
Ratio: 0.72014 real, 0.74109 virtual ripemd-128, RIPEMD 128:Raw
Ratio: 0.74238 real, 0.74691 virtual ripemd-160, RIPEMD 160:Raw
Ratio: 0.71129 real, 0.72288 virtual rsvp, HMAC-MD5 / HMAC-SHA1, RSVP, IS-IS:Many salts
Ratio: 0.78651 real, 0.79767 virtual rsvp, HMAC-MD5 / HMAC-SHA1, RSVP, IS-IS:Only one salt
Ratio: 0.88927 real, 0.95034 virtual sapb, SAP CODVN B (BCODE):Many salts
Ratio: 0.88227 real, 0.94267 virtual sapb, SAP CODVN B (BCODE):Only one salt
Ratio: 0.88743 real, 0.91268 virtual sapg, SAP CODVN F/G (PASSCODE):Only one salt
Ratio: 0.80841 real, 0.84077 virtual skein-256, Skein 256:Raw
Ratio: 0.80699 real, 0.84430 virtual skein-512, Skein 512:Raw
Ratio: 0.88449 real, 0.90813 virtual sxc, StarOffice .sxc:Raw
Ratio: 0.84594 real, 0.87571 virtual sybasease, Sybase ASE:Many salts
Ratio: 0.84086 real, 0.88334 virtual sybasease, Sybase ASE:Only one salt
Ratio: 0.81169 real, 0.87379 virtual tc_ripemd160, TrueCrypt RIPEMD160 AES256_XTS:Raw
Ratio: 0.82677 real, 0.89764 virtual tc_sha512, TrueCrypt SHA512 AES256_XTS:Raw
Ratio: 0.82540 real, 0.86825 virtual tc_whirlpool, TrueCrypt WHIRLPOOL AES256_XTS:Raw
Ratio: 0.80706 real, 0.84420 virtual tcp-md5, TCP MD5 Signatures, BGP:Many salts
Ratio: 0.79632 real, 0.85431 virtual tcp-md5, TCP MD5 Signatures, BGP:Only one salt
Ratio: 0.87893 real, 0.88763 virtual vtp, "MD5 based authentication" VTP:Only one salt
Ratio: 0.73033 real, 0.74832 virtual wbb3, WoltLab BB3:Raw
Ratio: 0.80433 real, 0.85773 virtual whirlpool:Raw
Ratio: 0.79311 real, 0.84194 virtual whirlpool0:Raw
Ratio: 0.80481 real, 0.83660 virtual whirlpool1:Raw
Ratio: 0.86097 real, 0.87625 virtual wpapsk, WPA/WPA2 PSK:Raw
Ratio: 0.78116 real, 0.82060 virtual xsha, Mac OS X 10.4 - 10.6:Many salts
Ratio: 0.68642 real, 0.73651 virtual xsha, Mac OS X 10.4 - 10.6:Only one salt
Ratio: 0.75988 real, 0.81722 virtual xsha512, Mac OS X 10.7:Many salts
Ratio: 0.79821 real, 0.87032 virtual xsha512, Mac OS X 10.7:Only one salt
../run/relbench -v omp-4-u64-vm-sse41.log omp-0-u64-vm-sse41.log
Ratio: 1.65040 real, 6.54213 virtual CRC32:Only one salt
Ratio: 1.41061 real, 3.87267 virtual Fortigate, FortiOS:Many salts
Ratio: 2.17818 real, 5.93296 virtual Fortigate, FortiOS:Only one salt
Ratio: 1.27402 real, 3.46236 virtual HAVAL-128-4:Raw
Ratio: 1.50181 real, 4.11467 virtual HAVAL-256-3:Raw
Ratio: 1.49367 real, 4.07146 virtual HMAC-MD5:Many salts
Ratio: 2.09120 real, 5.48767 virtual HMAC-MD5:Only one salt
Ratio: 1.28490 real, 3.43064 virtual HMAC-SHA1:Many salts
Ratio: 2.03621 real, 5.35467 virtual HMAC-SHA1:Only one salt
Ratio: 1.99254 real, 7.91541 virtual LM:Raw
Ratio: 2.23229 real, 9.07040 virtual PST, custom CRC-32:Raw
Ratio: 1.41327 real, 5.91417 virtual Raw-MD4:Raw
Ratio: 1.23823 real, 5.10423 virtual Raw-MD5:Raw
Ratio: 1.08677 real, 2.96098 virtual gost, GOST R 34.11-94:Raw
Ratio: 1.31823 real, 3.57481 virtual hdaa, HTTP Digest access authentication:Many salts
Ratio: 1.35650 real, 3.66392 virtual hdaa, HTTP Digest access authentication:Only one salt
Ratio: 1.05858 real, 4.19895 virtual net-md5, "Keyed MD5" RIPv2, OSPF, BGP, SNMPv2:Only one salt
Ratio: 1.15457 real, 4.51186 virtual net-sha1, "Keyed SHA1" BFD:Only one salt
Original timings using PKCS5_PBKDF2_HMAC() if truecrypt:
$ OMP_NUM_THREADS=1 ../run/john -test=5 -form=tc_sha512
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: tc_sha512, TrueCrypt SHA512 AES256_XTS [64/64]... DONE
Raw: 204 c/s real, 208 c/s virtual
$ OMP_NUM_THREADS=8 ../run/john -test=5 -form=tc_sha512
Will run 8 OpenMP threads
Benchmarking: tc_sha512, TrueCrypt SHA512 AES256_XTS [64/64]... (8xOMP) DONE
Raw: 31.6 c/s real, 28.8 c/s virtual
New timings using pbkdf2_sha512 (not even to mention we CAN do this with SIMD)
$ OMP_NUM_THREADS=1 ../run/john -test=5 -form=tc_sha512
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: tc_sha512, TrueCrypt SHA512 AES256_XTS [64/64]... DONE
Raw: 352 c/s real, 358 c/s virtual
$ OMP_NUM_THREADS=8 ../run/john -test=5 -form=tc_sha512
Will run 8 OpenMP threads
Benchmarking: tc_sha512, TrueCrypt SHA512 AES256_XTS [64/64]... (8xOMP) DONE
Raw: 1342 c/s real, 187 c/s virtual
This is really a NO BRAINER!!!
I was also able to get ripemd160 working instantly in pass_gen.pl by passing &ripemd160 paramter to pp_pbkdf2 (and 2000 iterations). It works like a champ. I have to build a pbkdf2_hmac_ripemd160 for this.
However, whirlpool did NOT give me the same results for the pbkdf2 in pass_gen.pl. I do not know why but it did not.
Now:
$ OMP_NUM_THREADS=1 ../run/john -test=5 -form=tc_sha512
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: tc_sha512, TrueCrypt SHA512 AES256_XTS [128/128 SSE4.1 2x]... DONE
Raw: 641 c/s real, 647 c/s virtual
$ OMP_NUM_THREADS=8 ../run/john -test=5 -form=tc_sha512
Will run 8 OpenMP threads
Benchmarking: tc_sha512, TrueCrypt SHA512 AES256_XTS [128/128 SSE4.1 2x]... (8xOMP) DONE
Raw: 2439 c/s real, 343 c/s virtual
I have it ready to go. ONLY sha512 is overridden (for now). I do not have a pbkdf2 done for ripemd160 (yet). It should not be hard however, and THAT hash is very slow on my machine, so there may be a LOT to gain.
5c36dfb
Good stuff!
i now have pbkdf2-hmac-whirlpool.h, and pbkdf2-hmac-ripemd160.h into JtR code, and included and used in trueccrypt_fmt_plug.c whirlpool only got about 60% speedup, but ripemd got about 3x or so improvement.
oSSL also contains RIPEMD160. I added code to configure to autodetect. There is now HAVE_RIPEMD160 added into autoconfig.h I also put code in to use this within pbkdf2_hmac_ripemd160.h (used by truecrypt_fmt). Found out on my 64 byte cygwin system, it is about 10% slower than the sph code. It is almost certain that oSSL is using the same reference-level RipeMD that we are doing, but 'properly' based on the build being 64 bits (such as bit count accumulation in a single var). oSSL may be doing it using 2 vars or something (which is slower). Well whatever it is, I have commented it out for now (with an #if HAVE_RIPEMD160 && 0
) I might have a look on other systems. I bet that oSSL on 32 bit will be faster than sph, but I could be wrong in that assumption also.
Here is a breakdown of oSSL and sph_* for ripemd160 and whirlpool for 32 bit and 64 bit systems I have. I think this shows that running oSSL for whirlpool (if supported) is best, or almost even, and that running sph_ripemd160 is best or almost even, for all systems.
OS | algo | oSSL | shp_xx |
---|---|---|---|
Cygwin32 | ripemd160 | 135 | 153** |
Cygwin64 | ripemd160 | 152 | 150 |
Ubuntu32 | ripemd160 | 91.6 | 91.6 |
Ubuntu64 | ripemd160 | 142 | 163** |
Cygwin32 | whirlpool | 229*** | 63.5 |
Cygwin64 | whirlpool | 230 | 227 |
Ubuntu32 | whirlpool | 133** | 34.8 |
Ubuntu64 | whirlpool | 190** | 145 |
Are you running ubuntu inside a VM, on different hardware, or why are the numbers so poor (compared to cygwin)? Can you post the patches needed to reproduce this? I#d like to try it on 32bit and 64bit Fedora.
In a VM. That 32 bit VM only have 2 cores. All other tests had 4 (real CPU is DUAL-HT)
Btw, here are the new speeds after the latest commit of this format, cygwin64 speeds only, but on same hardware as the above table
$ ../run/john -test=5 -format=tc_* Will run 4 OpenMP threads Benchmarking: tc_aes_xts, TrueCrypt (RIPEMD160/SHA512/WHIRLPOOL) AES256_XTS [128/128 XOP 2x]... (4xOMP) DONE Raw: 2361 c/s real, 1223 c/s virtual
Benchmarking: tc_ripemd160, TrueCrypt RIPEMD160 AES256_XTS [32/64]... (4xOMP) DONE Raw: 469 c/s real, 124 c/s virtual
Benchmarking: tc_sha512, TrueCrypt SHA512 AES256_XTS [128/128 XOP 2x]... (4xOMP) DONE Raw: 3718 c/s real, 994 c/s virtual
Benchmarking: tc_whirlpool, TrueCrypt WHIRLPOOL AES256_XTS [64/64]... (4xOMP) DONE Raw: 709 c/s real, 189 c/s virtual
All 4 formats passed self-tests!
That's beautiful for the relbench figures :) TC did exist in J7, didn't it? Heck it was so long ago I do not really know.
I am pretty sure TC has been there a bit. It is 10 to 20x faster now. (or on my cygwin omp build, about 4000x faster for sha512, LOL).
Now I gotta get krb5-18 done. That one should also get a huge improvement if we can get past the libkrb5 stuff.
That actually is not a bad idea. Identifying the hashes that use slow high level libs, but where the hash was present in J7, 'fixing' them to be faster native, or even SIMD, and fudging the relbench numbers :)
Btw, I have the TC_* hashes in pass_gen.pl now, so I can also add them to jtrts.pl I have added them to the 'add missing hashes ' issue on jtrts.
krb5-18 is now done. This was the last super bad one on cygwin (where OMP was SLOWER by far, than OMP=1)
Benchmark:
$ ../run/john-no-omp -test -form=cpu | tee no-omp.txt
$ OMP_NUM_THREADS=1 ../run/john-omp -test -form=cpu | tee omp1.txt
$ OMP_NUM_THREADS=4 ../run/john-omp -test -form=cpu | tee omp4.txt
First I thought the below said phpass should run larger batches in non-omp builds:
$ ../run/relbench -v no-omp.txt omp1.txt | grep ^Ratio | sort -k2,2nr | head
Ratio: 6.16163 real, 6.10060 virtual phpass ($P$9):Raw
Ratio: 6.01463 real, 6.13602 virtual dynamic_17:Raw
Ratio: 1.77308 real, 1.73814 virtual cq, ClearQuest:Raw
Ratio: 1.26606 real, 1.24109 virtual xsha, Mac OS X 10.4 - 10.6:Only one salt
Ratio: 1.24405 real, 1.23173 virtual Raw-MD5:Raw
Ratio: 1.22452 real, 1.21230 virtual nt2, NT:Raw
Ratio: 1.21681 real, 1.20477 virtual Raw-SHA1:Raw
Ratio: 1.20770 real, 1.21976 virtual Raw-MD4:Raw
Ratio: 1.17541 real, 1.17541 virtual Citrix_NS10, Netscaler 10:Only one salt
Ratio: 1.16873 real, 1.14574 virtual HMAC-SHA1:Many salts
Then I saw this
$ grep -A1 phpass no-omp.txt omp1.txt
no-omp.txt:Benchmarking: dynamic_17 [phpass ($P$ or $H$) 32/64 1x2 (MD5_body)]... DONE
no-omp.txt-Raw: 3690 c/s real, 3617 c/s virtual
--
no-omp.txt:Benchmarking: phpass ($P$9) [phpass ($P$ or $H$) 32/64 1x2 (MD5_body)]... DONE
no-omp.txt-Raw: 3638 c/s real, 3638 c/s virtual
--
omp1.txt:Benchmarking: dynamic_17 [phpass ($P$ or $H$) 128/128 AVX 4x4x3]... DONE
omp1.txt-Raw: 22194 c/s real, 22194 c/s virtual
--
omp1.txt:Benchmarking: phpass ($P$9) [phpass ($P$ or $H$) 128/128 AVX 4x4x3]... DONE
omp1.txt-Raw: 22416 c/s real, 22194 c/s virtual
Why is a non-omp build using MD5_body!? It's the same for dynamic_17.
Very poor figures below. Ideally it should be 4.0
$ grep -A2 4xOMP omp4.txt | sed 's/^--$//' > only_omp.txt
$ ../run/relbench -v no-omp.txt only_omp.txt | grep ^Ratio | sort -k2,2n | head
Ratio: 1.00828 real, 0.99830 virtual crypt, generic crypt(3) DES:Many salts
Ratio: 1.02357 real, 0.77956 virtual PST, custom CRC-32:Raw
Ratio: 1.02592 real, 1.03618 virtual crypt, generic crypt(3) DES:Only one salt
Ratio: 1.03944 real, 0.46869 virtual EPiServer:Many salts
Ratio: 1.04432 real, 0.50199 virtual EPiServer:Only one salt
Ratio: 1.04622 real, 0.46495 virtual chap, iSCSI CHAP authentication:Raw
Ratio: 1.04902 real, 0.79894 virtual dynamic_2006:Only one salt
Ratio: 1.08379 real, 0.78540 virtual dynamic_2009:Only one salt
Ratio: 1.11291 real, 0.76219 virtual dynamic_2014:Only one salt
Ratio: 1.12326 real, 0.81991 virtual dynamic_1003:Raw
The list is a lot longer and doesn't reach 2.0 (50% efficiency) until line 90 out of 511.
Phpass should be fixed. Yes, this was a bug, with #define's mixed up. The #define block was a little complex (like 4 different things going on, between OMP, SIMD, BE and MD5_X2. Thank god for syntax highlighting and smart #ifdef highlighting.
Fixed the few worst. cq got some awsome speedup even without OMP: 50337K -> 90725K and OMPx4 60633K -> 258998K.
I bet we have 30 or so formats that could benefit a lot from OMP_SCALE tuning, among at least the double number to test.
Made good progress today: Still some to go (since my auto code only works for formats if they do a max *= (scale*omp_t)
and many formats do not do that I do have a list of other formats that 'look' like we could do some better work.
Some of these may not benefit. Some may not have OMP at all. These are just ones that mostly appear like they may have benefits we can give them, OR when trying to kick up OMP_SCALE, it kept a fixed size. I just did IPB2 to get an idea of how best to attack some of these formats that do not allow the external upward adjustment of OMP_SCALE env var.
made into a thin format, dyna_2004 2-3x improvement + OMP
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
not enough gain to matter
To make that a checklist you need to start each line with dash
- [ ] like this
md4gen is an old thing made by Solar (but not in core iirc). I use to ignore it because Dynamic obsoletes it. Actually we could drop it IMHO.
mssql and mssql05 does not support OMP. They probably should.
$ OMP_SCALE=2 OMP_NUM_THREADS=4 ../run/john -test -form:mssql
Benchmarking: mssql, MS SQL [SHA1 128/128 AVX 8x]... DONE
Many salts: 20360K c/s real, 20564K c/s virtual
Only one salt: 10540K c/s real, 10540K c/s virtual
Your script could detect that we did not get the (4xOMP)
so no need to bother
NT does not and can not support OMP.
I have pulled md4-gen and NT. I probably should also pull sha1-gen It is also somewhat redundant.
Dyna formats may need to have scale looked at. BUT they should ALL get updated with a single change. right now, they are 6144 (I think) in scale (OMP-4). It may be 6144 fixed no matter what the OMP level. But I think it should be increased (some), or possibly recomputed for each type.
578fe51 fixed a whole lot of formats with this one-liner[tm]
$ git grep -El "keys_per_crypt = omp_t \* M(IN|AX)_KEYS_PER_CRYPT;" | xargs sed -ri 's/keys_per_crypt = omp_t \* M(IN|AX)_KEYS_PER_CRYPT;/keys_per_crypt *= omp_t;/'
So we can use OMP_SCALE environment car now ?
I would guess so but haven't bothered. Go ahead if you like.
Note to self: md5crypt is odd - I tried unifying it but reverted. And scrypt does not benefit at all from scaling - it's just too slow.
I doubt scrypt or bcrypt for that matter will have any benefit. Super slow ones just don't matter, and are better with scale 1
I did some work on dyna. I was able to get things to have more or less values in OMP. Now comes the tough part. There are some formats that get slower (some much slower), when you increase count, and some that get faster. So, I may have to make some changes. For these changes I will need to:
But all in all, I think I can get 150% for many formats (the faster ones), might get 200% for some that are over scaled today, and get 110-120% for most others. Some may have no change at all.
For example, here were some timings I saw
8x-OMP on my Core-i7 quad HT
raw-md5
68000k
dyna_0
44000k (6144 scale)
54000k (6144*2 scale)
56000k (6144*3 scale)
56000k (6144*4 scale) (this one fluxuated quite a bit)
So you can see for this format, we got about half of the loss back. There is no way we can get it all, there simply is WAY too much overhead in dyna that I simply can not eliminate.
But I do think I can get dyna working quite a bit quicker for many of the formats. It may not be a trivial undertaking, but I really think it needs done. At the same time, I really would love to simplify this somewhat, but I may not be able to do that. Being able to handle SIMD (of multiple flavors), oSSL, md5_go, md5_body and md5_body-x2 is NOT trivial, especially allowing switching in and out of SIMD/flat. The code prior to md5_body-x2 was quite a bit more simplistic, BUT that md5_body_x2 does provide a significant improvement in performance, so I think coding for it, even though it adds 100's of extra lines of #ifdef code, does make the format faster.
Is Dynamic all-or-nothing OMP? I usually build with --disable-openmp-for-fast-formats. I use --fork for the fast ones.
I believe it is all or nothing. I may have to address the disable-omp-for-fast-formats switch.
Ok, here is a 'checking' program (to test BE and LE GETPOS macros).
#include <stdio.h>
#include <string.h>
// GETPOS test.
// This will work with generating a proper get-pos for SIMD_COEF_32 of
// 2, 4, 6, 16 limbs, for LE or BE.
void dump4(int, unsigned char *);
// I did this without using #defines, to allow easier debugging in MSVC.
// These will be defines in 'real' code
int SIMD_COEF_32=2;
int SIMD_SHIFT;
int SHA_BUF_SIZ=2;
int MD5_BUF_SIZ=2;
#define GETPOS_BE32(i, idx) ((idx & (SIMD_COEF_32-1)) * 4 + ((i) & (0xffffffff - 3)) * SIMD_COEF_32 + (((i) & 3) ^ 3) + (idx >> SIMD_SHIFT) * SHA_BUF_SIZ * SIMD_COEF_32 * 4)
#define GETPOS_LE32(i, idx) ((idx & (SIMD_COEF_32-1)) * 4 + ((i) & (0xffffffff - 3)) * SIMD_COEF_32 + ((i) & 3) + (idx >> SIMD_SHIFT) * MD5_BUF_SIZ * SIMD_COEF_32 * 4)
int main() {
unsigned char Buf[512];
int i, idx;
for (SIMD_COEF_32 = 2; SIMD_COEF_32 <= 16; SIMD_COEF_32 <<= 1) {
if (SIMD_COEF_32==16) SIMD_SHIFT=4;
else if (SIMD_COEF_32==8) SIMD_SHIFT=3;
else if (SIMD_COEF_32==4) SIMD_SHIFT=2;
else if (SIMD_COEF_32==2) SIMD_SHIFT=1;
memset(Buf, 0, sizeof(Buf));
for (idx = 0; idx < SIMD_COEF_32*2; ++idx) {
for (i = 0; i < 4; ++i) {
Buf[GETPOS_BE32(i,idx)]= (unsigned char)((i*16)+idx);
}
}
dump4(1, Buf);
memset(Buf, 0, sizeof(Buf));
for (idx = 0; idx < SIMD_COEF_32*2; ++idx) {
for (i = 0; i < 4; ++i) {
Buf[GETPOS_LE32(i,idx)]= (unsigned char)((i*16)+idx);
}
}
dump4(0, Buf);
printf("\n");
}
}
/* Dumps 4 limbs. Does BE to LE conversion if in BE format. */
void dump4(int isBE, unsigned char *buf) {
int i;
printf ("%s_%d: ", isBE?"BE":"LE", SIMD_COEF_32);
for (i = 0; i < SIMD_COEF_32*4*4; i += 4) {
if (buf[i] == 0 && buf[i+1] == 0 && buf[i+2] == 0 && buf[i+3] == 0)
printf ("0 ");
else {
if (isBE)
printf ("%02x%02x%02x%02x ", buf[i+3], buf[i+2],buf[i+1],buf[i]);
else
printf ("%02x%02x%02x%02x ", buf[i], buf[i+1],buf[i+2],buf[i+3]);
}
}
printf ("\n");
}
Here are some results. @magnum, can you please validate that the results for 8 and 16 coef are correct. I 'think' they are, but I am not 100% sure.
BE_2: 00102030 01112131 0 0 02122232 03132333 0 0
LE_2: 00102030 01112131 0 0 02122232 03132333 0 0
BE_4: 00102030 01112131 02122232 03132333 0 0 0 0 04142434 05152535 06162636 07172737 0 0 0 0
LE_4: 00102030 01112131 02122232 03132333 0 0 0 0 04142434 05152535 06162636 07172737 0 0 0 0
BE_8: 00102030 01112131 02122232 03132333 04142434 05152535 06162636 07172737 0 0 0 0 0 0 0 0 08182838 09192939 0a1a2a3a 0b1b2b3b 0c1c2c3c 0d1d2d3d 0e1e2e3e 0f1f2f3f 0 0 0 0 0 0 0 0
LE_8: 00102030 01112131 02122232 03132333 04142434 05152535 06162636 07172737 0 0 0 0 0 0 0 0 08182838 09192939 0a1a2a3a 0b1b2b3b 0c1c2c3c 0d1d2d3d 0e1e2e3e 0f1f2f3f 0 0 0 0 0 0 0 0
BE_16: 00102030 01112131 02122232 03132333 04142434 05152535 06162636 07172737 08182838 09192939 0a1a2a3a 0b1b2b3b 0c1c2c3c 0d1d2d3d 0e1e2e3e 0f1f2f3f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10203040 11213141 12223242 13233343 14243444 15253545 16263646 17273747 18283848 19293949 1a2a3a4a 1b2b3b4b 1c2c3c4c 1d2d3d4d 1e2e3e4e 1f2f3f4f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
LE_16: 00102030 01112131 02122232 03132333 04142434 05152535 06162636 07172737 08182838 09192939 0a1a2a3a 0b1b2b3b 0c1c2c3c 0d1d2d3d 0e1e2e3e 0f1f2f3f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10203040 11213141 12223242 13233343 14243444 15253545 16263646 17273747 18283848 19293949 1a2a3a4a 1b2b3b4b 1c2c3c4c 1d2d3d4d 1e2e3e4e 1f2f3f4f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here were the 'originals'
#define GETPOS(i, idx) ((idx & (SIMD_COEF_32 - 1)) * 4 + \
((i) & (0xffffffff - 3)) * SIMD_COEF_32 + \
(((i) & 3) ^ 3) + (idx >> (SIMD_COEF_32 >> 1)) * \
SHA_BUF_SIZ * SIMD_COEF_32 * 4)
#define GETPOS(i, idx) ((idx & (SIMD_COEF_32 - 1)) * 4 + \
((i) & (0xffffffff - 3)) * SIMD_COEF_32 + \
((i) & 3) + (idx >> (SIMD_COEF_32 >> 1)) * \
MD5_BUF_SIZ * SIMD_COEF_32 * 4)
Here are my modifications
#define GETPOS_BE32(i,idx) ((idx & (SIMD_COEF_32-1)) * 4 + \
((i) & (0xffffffff - 3)) * SIMD_COEF_32 + \
(((i) & 3) ^ 3) + (idx >> SIMD_SHIFT) * \
SHA_BUF_SIZ * SIMD_COEF_32 * 4)
#define GETPOS_LE32(i,idx) ((idx & (SIMD_COEF_32-1)) * 4 + \
((i) & (0xffffffff - 3)) * SIMD_COEF_32 + \
((i) & 3) + (idx >> SIMD_SHIFT) * \
MD5_BUF_SIZ * SIMD_COEF_32 * 4)
The only thing that was changed was the (SIMD_COEF_32>>1)`` becoming
SIMD_SHIFT``` and SIMD_SHIFT will probably have to be SIMD32_SHIFT since we may have the same thing for SIMD64
I think they are OK, but I have been confused before :-)
The only thing that was changed was the (SIMD_COEF_32>>1)`becoming SIMD_SHIFT``
This should be correct.
I am pretty sure dyna will be a real bitch when it comes to COEF > 4
I think for GSOC, we should probably just have them first create the SIMD code, and then simply create as small an update as possible for the raw format. So they should update:
rawMD4_fmt_plug.c rawMD5_fmt_plug.c rawSHA1_fmt_plug.c rawSHA256_fmt_plug.c rawSHA512_fmt_plug.c
at first. Get them solid, and with as 'minimal' as possible change so that COEF==4|8|16 all work. Once we get to that point, we can start to move out. I think the pbkdf2_*.h would probably be the 2nd thing(s) to do, as they are somewhat simple to change, and impact a bunch of formats. Then start doing one off's on the other 70 or so SIMD formats, and also start on dyna.
Yes, getting some of these things like we are in this thread done up front is good. I really think we could get the raw formats 'ready', just waiting for the code. But on a lot of it, we simply will have to put out fires.
Here is an example (cygwin64, OMP build)
I have fixed one of these https://github.com/magnumripper/JohnTheRipper/commit/38aa13b3d9d1999be70973117d2f1a893190aaf5 But we need to identify them, and at least work around them (like this). It would be nice to know WHY, and if why, then try some way to auto-detect.
It likely (for the tc_* hashes on cygwin), that the PKCS5_PBKDF2_HMAC() oSSL function was made thread safe by simply putting a mutex around the call, OR worse, they put a mutex around some interal part of the call (such as the HMAC). The 2nd is more likely. So what happens is 8 threads start into PKCS5_PBKDF2_HMAC() call, then they all but one block and that one does an HMAC. That thread then is stopped, and the next thead gets it's chance to go. HORRIBLE performance!!
That is just a theory, but from dealing with poorly coded MT code before, it is what it looks like it is doing.
I will add some rel-bench runs where I cut out things that did not look quite right, and put them into a follow up post.