bcrypt-opencl autotuning

solardiz commented 5 years ago

Inspired by hashcat's tweet/commit:

https://twitter.com/hashcat/status/1107399818740203520 https://github.com/hashcat/hashcat/commit/5ecbcde94515a31113bc2d33a5188ffcbe2e7dcf

which does basically:

   fixed_local_size = (device_param->device_local_mem_size - 4) / 4096;

I took a look at what we do in JtR. Turns out, we do have similar LWS autotuning since commit efd44decdd75436fe7df0e01d9882e3589c0be2f back in 2013. However, instead of directly calculating the max LWS that fits, we halve the LWS until it fits:

        const int       lmem_per_th = ((1024 + 4) * sizeof(cl_uint) + 64);
[...]
        if ((get_device_type(gpu_id) != CL_DEVICE_TYPE_CPU) &&
            lmem_per_th < get_local_memory_size(gpu_id))
                while (local_work_size >
                       get_local_memory_size(gpu_id) / lmem_per_th)
                        local_work_size >>= 1;

I guess we could gain some performance by directly setting LWS to the maximum that fits, like hashcat does now. For example, on GTX 1080 we're now getting:

ptxas info    : Used 72 registers, 32772 bytes smem, 376 bytes cmem[0]

Notice that this is 32K + 4 bytes. On this device, we could probably go up to 48K-4K. (It's a pity that 4 bytes are somehow lost, just like in hashcat. Ideally, we'd also figure this out and avoid it, which would let both projects fit an extra instance of bcrypt in the last 4K.)

Also, we have a hardcoded maximum GWS of 4096, which isn't necessarily optimal. We should improve bcrypt-opencl's GWS autotuning as well.

Edit: see also: #2620 #3638.

solardiz commented 5 years ago

What we have now:

[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=1
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 1: gfx900 [Radeon RX Vega]
Local worksize (LWS) 8, Global worksize (GWS) 4096
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    25600 c/s real, 1433K c/s virtual

[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=4
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 4: GeForce GTX 1080
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -DSM_MAJOR=6 -DSM_MINOR=1 -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=524306 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=8 $JOHN/kernels/bf_kernel.cl
Local worksize (LWS) 8, Global worksize (GWS) 4096
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    6501 c/s real, 6553 c/s virtual, GPU util: 100%

[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=5
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 5: GeForce GTX TITAN X
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -DSM_MAJOR=5 -DSM_MINOR=2 -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=262162 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=8 $JOHN/kernels/bf_kernel.cl
Local worksize (LWS) 8, Global worksize (GWS) 4096
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    4762 c/s real, 4762 c/s virtual, GPU util: 100%

[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=6 
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 6: GeForce GTX TITAN
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -DSM_MAJOR=3 -DSM_MINOR=5 -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=8 $JOHN/kernels/bf_kernel.cl
Local worksize (LWS) 8, Global worksize (GWS) 1024
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    875 c/s real, 867 c/s virtual, GPU util: 100%

Attempt at a trivial change:

+++ b/src/opencl_bf_std_plug.c
@@ -172,9 +172,7 @@ void BF_select_device(struct fmt_main *fmt) {
           In extreme cases we even fallback to using CPU kernel. */
        if ((get_device_type(gpu_id) != CL_DEVICE_TYPE_CPU) &&
            lmem_per_th < get_local_memory_size(gpu_id))
-               while (local_work_size >
-                      get_local_memory_size(gpu_id) / lmem_per_th)
-                       local_work_size >>= 1;
+               local_work_size = get_local_memory_size(gpu_id) / lmem_per_th;

        if ((get_device_type(gpu_id) == CL_DEVICE_TYPE_CPU) ||
            amd_vliw5(device_info[gpu_id]) ||

This gives the expected LWS=15 on AMD and LWS=11 on NVIDIA. In fact, I expected LWS=15 on AMD would fail because IIRC per documentation even though the total LWS is 64 KB, only 32 KB can be allocated at once - maybe no longer the case with Vega or with AMDGPU-PRO? Surprisingly, LWS=15 works, but gives lower speeds (could be corrected by a different GWS maybe?) However, on NVIDIA there are various failures (some limitation in our code needing to be relaxed maybe?)

[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=1
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 1: gfx900 [Radeon RX Vega]
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=522 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=2766 -DDEV_VER_MINOR=4 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=15 $JOHN/kernels/bf_kernel.cl
Build log: /tmp/OCL25493T1.cl:117:17: warning: unknown attribute 'max_constant_size' ignored
        __attribute__((max_constant_size(16)))
                       ^
/tmp/OCL25493T1.cl:121:17: warning: unknown attribute 'max_constant_size' ignored
        __attribute__((max_constant_size(72)))
                       ^
/tmp/OCL25493T1.cl:129:17: warning: unknown attribute 'max_constant_size' ignored
        __attribute__((max_constant_size(4096)))
                       ^
3 warnings generated.

binary size 1374632
Local worksize (LWS) 15, Global worksize (GWS) 3840
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    17943 c/s real, 1920K c/s virtual

[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=4
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 4: GeForce GTX 1080
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -DSM_MAJOR=6 -DSM_MINOR=1 -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=524306 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=11 $JOHN/kernels/bf_kernel.cl
Build log: 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'blowfish' for 'sm_61'
ptxas info    : Function properties for blowfish
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 45060 bytes smem, 376 bytes cmem[0]
0: OpenCL UNKNOWN OPENCL ERROR (-9999) error in opencl_bf_std_plug.c:322 - Sync :FAILED
[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=5
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 5: GeForce GTX TITAN X
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -DSM_MAJOR=5 -DSM_MINOR=2 -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=262162 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=11 $JOHN/kernels/bf_kernel.cl
Build log: 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'blowfish' for 'sm_52'
ptxas info    : Function properties for blowfish
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 45060 bytes smem, 376 bytes cmem[0]
0: OpenCL UNKNOWN OPENCL ERROR (-9999) error in opencl_bf_std_plug.c:322 - Sync :FAILED
[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=6
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 6: GeForce GTX TITAN
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -DSM_MAJOR=3 -DSM_MINOR=5 -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=11 $JOHN/kernels/bf_kernel.cl
Build log: 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'blowfish' for 'sm_35'
ptxas info    : Function properties for blowfish
ptxas         .     45056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 56 registers, 376 bytes cmem[0]
Local worksize (LWS) 11, Global worksize (GWS) 1408
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
FAILED (cmp_all(1) $2a$05$CCCCCCCCCCCCCCCCCCCCC.E5YPO9kmyuRGyh0XouQYb4YMJKvyOeW)

solardiz commented 5 years ago

Surprisingly, LWS=15 works, but gives lower speeds (could be corrected by a different GWS maybe?)

Higher GWS only partially recovers the speed on Vega 64, not to the level we had with LWS=8:

[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=1
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 1: gfx900 [Radeon RX Vega]
Local worksize (LWS) 15, Global worksize (GWS) 15360
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    20756 c/s real, 3072K c/s virtual

jsteube commented 5 years ago

Notice that this is 32K + 4 bytes. On this device, we could probably go up to 48K-4K. (It's a pity that 4 bytes are somehow lost, just like in hashcat. Ideally, we'd also figure this out and avoid it, which would let both projects fit an extra instance of bcrypt in the last 4K.)

I experimented a bit with this, too. I couldn't find an explanation in the documentation. The only thing I found out is that if you use an u8 datatype instead, the +4 becomes an +1. Still, 1 byte too much.

This gives the expected LWS=15 on AMD and LWS=11 on NVIDIA. In fact, I expected LWS=15 on AMD would fail because IIRC per documentation even though the total LWS is 64 KB, only 32 KB can be allocated at once - maybe no longer the case with Vega or with AMDGPU-PRO?

On AMD this 4 extra byte is not required. Therefore LWS should be autotuned to 16 (or 32). On my Vega64 which runs on ROCM I can make full use of 64kB, but strangely without speed increase compared to 32kB.

solardiz commented 5 years ago

Thanks @jsteube! I also just got LWS=16 working on Vega64, and yes it's almost the same speed as, or even slightly slower than, LWS=8:

-       const int       lmem_per_th = ((1024 + 4) * sizeof(cl_uint) + 64);
+       const int       lmem_per_th = 4096;

With GWS=4096, there's slight slowdown going from LWS=8 (we had 25600) to LWS=16 (now 25150):

Local worksize (LWS) 16, Global worksize (GWS) 4096
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    25150 c/s real, 2867K c/s virtual

With GWS=16384, it's same speed (25801) for LWS 8 vs. 16:

Local worksize (LWS) 8, Global worksize (GWS) 16384
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    25801 c/s real, 3276K c/s virtual

Local worksize (LWS) 16, Global worksize (GWS) 16384
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    25801 c/s real, 3276K c/s virtual

solardiz commented 5 years ago

Surprisingly, I am getting better speeds on NVIDIA at LWS=4, especially at higher GWS. I guess the physical 48 KB are then split between 3 (or just 2?) wavefronts, and thus are used more fully than they could be with our LWS=8, which took 32 KB and thus permitted for only one?

Device 4: GeForce GTX 1080
Options used: -I /home/solar/j/bleeding-jumbo-20190305/run/kernels -cl-mad-enable -DSM_MAJOR=6 -DSM_MINOR=1 -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=524306 -DSIZEOF_SIZE_T=8 -DDEV_VER_MAJOR=418 -DDEV_VER_MINOR=39 -D_OPENCL_COMPILER -DWORK_GROUP_SIZE=4 $JOHN/kernels/bf_kernel.cl
Build log:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'blowfish' for 'sm_61'
ptxas info    : Function properties for blowfish
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 72 registers, 16388 bytes smem, 376 bytes cmem[0]
Local worksize (LWS) 4, Global worksize (GWS) 16384
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from test vectors
DONE
Speed for cost 1 (iteration count) of 32
Raw:    8402 c/s real, 8402 c/s virtual, GPU util: 100%

At LWS=8 GWS=4096, we had ~6500.

@jsteube What are your current speeds on such devices?

solardiz commented 5 years ago

Per my experiments, our bcrypt-opencl fails on NVIDIA when LWS isn't a power of 2. I'd suspect a genuine dependency on that property somewhere in our code, but I can't find that and the format works just fine with e.g. LWS=15 on AMD. So I'm puzzled.

jsteube commented 5 years ago

I can confirm that NV allows LWS=11. That's what hashcat is calculating if there's 32kB local memory on NV. But something seems odd with the reported 8402, it should be faster.

On super, running hashcat (v5.1.0-762-g59ecdbd):

[atom@super hashcat]$ ./hashcat -b -m 3200 -d 4 --mac
...
4:3200:1898:4513:29.65:21058

Your reported speeds on the Vega is equal to hashcat.

jsteube commented 5 years ago

Vega on super. Had to manually tune to allow kernel runtime > 90ms.

[atom@super hashcat]$ ./hashcat -b -m 3200 -d 1 --mac --force -u 16 -n 4 
...
1:3200:1663:945:390.60:25701

solardiz commented 5 years ago

Yes, our speeds on NVIDIA Maxwell and above are odd (Hashcat's are much better), irrespective of LWS. This is issue #2620.

I doubt I'd look into any of this and make any fixes before our upcoming release. I'd appreciate it if someone else in our community takes care of some or all of this.

solardiz commented 5 years ago

Looks like part of the issue with non-power-of-2 LWS failing on NVIDIA is this in opencl_bf_std_plug.c:

        /* N has to be a multiple of M */
        N = (N + M - 1) / M * M;

The rounding up brings N to higher than compile-time max GWS for bcrypt-opencl. This happens on AMD too; no idea why it only causes visible failure on NVIDIA.

Changing that line to round down:

        N -= N % M;

makes the error "OpenCL UNKNOWN OPENCL ERROR (-9999) error in opencl_bf_std_plug.c:322 - Sync :FAILED" go away, and auto-tuning completes OK, but then we get "FAILED (cmp_all(29) $2a$05$/OK.fbVrR/bpIqNJ5ianF.swQOIzjOiJ9GHEPuhEkvqrUyvWhEMx6)".

I think I'm not going to look into this any further. This is overly complicated code that is due for a rewrite.

solardiz commented 5 years ago

This passes self-test:

+++ b/src/opencl_bf_std_plug.c
@@ -172,9 +172,13 @@ void BF_select_device(struct fmt_main *fmt) {
           In extreme cases we even fallback to using CPU kernel. */
        if ((get_device_type(gpu_id) != CL_DEVICE_TYPE_CPU) &&
            lmem_per_th < get_local_memory_size(gpu_id))
+#if 1
+               local_work_size = get_local_memory_size(gpu_id) / lmem_per_th;
+#else
                while (local_work_size >
                       get_local_memory_size(gpu_id) / lmem_per_th)
                        local_work_size >>= 1;
+#endif

        if ((get_device_type(gpu_id) == CL_DEVICE_TYPE_CPU) ||
            amd_vliw5(device_info[gpu_id]) ||
@@ -311,6 +315,8 @@ void exec_bf(cl_uint *salt_api, cl_uint *BF_out, cl_uint rounds, int n) {

        /* N has to be a multiple of M */
        N = (N + M - 1) / M * M;
+       if (N > BF_N)
+               N -= M;

        errMsg = "Copy data to device: Failed" ;

but I'm not sure it's correct, and the speed is poor (worse than without these changes):

$ ./john -test -form=bcrypt-opencl -v=5 -dev=4
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 4: GeForce GTX 1080
Options used: -I /home/solar/j/bleeding-jumbo-20190331/run/kernels -cl-mad-enable -DSM_MAJOR=6 -DSM_MINOR=1 -cl-nv-verbose -D__GPU_l
0: OpenCL CL_INVALID_WORK_GROUP_SIZE (-54) error in opencl_bf_std_plug.c:328 - Enque Kernel Failed
[solar@super run]$ ./john -test -form=bcrypt-opencl -v=5 -dev=4
initUnicode(UNICODE, ASCII/ASCII)
ASCII -> ASCII -> ASCII
Device 4: GeForce GTX 1080
Options used: -I /home/solar/j/bleeding-jumbo-20190331/run/kernels -cl-mad-enable -DSM_MAJOR=6 -DSM_MINOR=1 -cl-nv-verbose -D__GPU_l
Local worksize (LWS) 11, Global worksize (GWS) 2816
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Loaded 13 hashes with 6 different salts to test db from s
DONE
Speed for cost 1 (iteration count) of 32
Raw:    5120 c/s real, 5120 c/s virtual, GPU util: 100%

openwall / john

bcrypt-opencl autotuning #3673