openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.36k stars 2.11k forks source link

Allow very long duration for crypt_all() #1480

Open magnumripper opened 9 years ago

magnumripper commented 9 years ago

With OpenCL and our current format interface, very slow formats and/or weak devices may lead to situations where we "can't" run at optimal work size because the total duration of each crypt_all() call would be too long (even tens of minutes).

Here's an example: Office2013 running on an nvidia GT650M:

Benchmarking: office2013-opencl, MS Office 2013 (100,000 iterations) [SHA512 OpenCL AES]... (8xOMP) Calculating best global worksize (GWS); max. 10s total for crypt_all()
Raw speed figures including buffer transfers:
xfer: 33.216us, xfer: 4.320us, init: 213.888us, loop: 1000x16.438ms,  final: 447.136us, xfer: 9.920us
gws:      1024          62 c/s     6200248 rounds/s   16.439s per crypt_all()!
xfer: 62.880us, xfer: 6.336us, init: 218.720us, loop: 1000x16.436ms,  final: 576.128us, xfer: 14.592us
gws:      2048         124 c/s    12400496 rounds/s   16.437s per crypt_all()+
xfer: 124.448us, xfer: 9.024us, init: 437.024us, loop: 1000x32.882ms,  final: 1.151ms, xfer: 24.320us
gws:      4096         124 c/s    12400496 rounds/s   32.885s per crypt_all() - too slow
Local worksize (LWS) 1024, global worksize (GWS) 2048

So we have a limit at 10 seconds of total crypt_all() duration. But this format, on this device, takes 16 seconds already at 1024 - so we allow it. Then we see that 2048 takes about as long even though it does twice the number of hashes so obviously we allow that too. But for 4096 we get no speedup so we give up and settle for 2048. The actual single kernel duration though, is only 16 ms (it's called a thousand times).

If we ditch that "max 10 seconds" rule, we get this:

Benchmarking: office2013-opencl, MS Office 2013 (100,000 iterations) [SHA512 OpenCL AES]... (8xOMP) Calculating best global worksize (GWS); max. 3600s total for crypt_all()
Raw speed figures including buffer transfers:
xfer: 33.408us, xfer: 4.736us, init: 222.592us, loop: 1000x17.200ms,  final: 458.592us, xfer: 9.056us
gws:      1024          59 c/s     5900236 rounds/s   17.202s per crypt_all()!
xfer: 62.976us, xfer: 6.784us, init: 230.176us, loop: 1000x17.200ms,  final: 579.808us, xfer: 14.240us
gws:      2048         119 c/s    11900476 rounds/s   17.202s per crypt_all()!
xfer: 123.968us, xfer: 9.088us, init: 458.144us, loop: 1000x34.411ms,  final: 1.158ms, xfer: 24.352us
gws:      4096         119 c/s    11900476 rounds/s   34.414s per crypt_all()
xfer: 243.296us, xfer: 14.208us, init: 909.152us, loop: 1000x65.768ms,  final: 2.288ms, xfer: 43.744us
gws:      8192         124 c/s    12400496 rounds/s   65.774s per crypt_all()+
xfer: 483.680us, xfer: 23.744us, init: 1.751ms, loop: 1000x130.414ms,  final: 3.280ms, xfer: 45.312us
gws:     16384         125 c/s    12500500 rounds/s  130.425s per crypt_all()
xfer: 512us, xfer: 25.184us, init: 3.127ms, loop: 1000x225.690ms,  final: 5.760ms, xfer: 85.408us
gws:     32768         145 c/s    14500580 rounds/s  225.708s per crypt_all()+
xfer: 1.045ms, xfer: 47.072us, init: 5.359ms, loop: 1000x394.534ms,  final: 10.818ms, xfer: 171.712us
gws:     65536         166 c/s    16600664 rounds/s  394.568s per crypt_all()+
xfer: 2.175ms, xfer: 90.976us, init: 9.896ms, loop: 1000x789.068ms,  final: 21.637ms, xfer: 332.992us
gws:    131072         166 c/s    16600664 rounds/s  789.134s per crypt_all()
xfer: 4.724ms, xfer: 199.776us, init: 19.780ms, loop: 1000x1.578s,  final: 43.275ms, xfer: 680.736us
gws:    262144         166 c/s    16600664 rounds/s 1578.267s per crypt_all()
xfer: 8.243ms, xfer: 344.800us, init: 39.577ms, loop: 1000x3.156s,  final: 86.542ms, xfer: 1.429ms
gws:    524288         166 c/s    16600664 rounds/s 3156.537s per crypt_all()
xfer: 16.466ms, xfer: 687.168us, init: 79.183ms, loop: 1000x6.312s,  final: 172.884ms, xfer: 2.867ms
gws:   1048576         166 c/s    16600664 rounds/s 6313.090s per crypt_all() - too slow
Local worksize (LWS) 1024, global worksize (GWS) 65536

We see here that if we could allow a duration of 789 seconds(!) we'll get a performance boost of 33% at a work size of 65536.

The problem is we'll get extremely bad "response time" (13 minutes!) for things like pressing 'q' to quit. However, this kernel is obviously a split one - the longest single kernel duration is still below 400 ms. So maybe we can work something out.

magnumripper commented 9 years ago

I'm thinking after that extended autotune (as above) we'd set initial GWS to 2048, but keep that GWS of 65536 as a card up the sleeve. After a couple of minutes running (no keypresses etc.), it could gear up and bump GWS to 65536.