openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
9.68k stars 2.05k forks source link

descrypt-opencl: Compile per-salt kernels using OpenMP #1618

Open magnumripper opened 8 years ago

magnumripper commented 8 years ago

This is for HARDCODE_SALT of course.

@Sayantan2048 this should basically be really trivial and would speed up initial building of all salts' kernels a whole lot on multicore systems (eg. super would do it 32x faster). However, you'd probably need to move some code around.

What do you think?

sayan1an commented 8 years ago

It's doable but I don't expect speed up of 32x or anywhere near it due to disk IO limitations. I'll move the code around anyway because this format doesn't support LWS autotune. Also, if we are to build all kernels, we'd do it in reset before starting actual cracking. So, it should be easy.

magnumripper commented 8 years ago

Even if it's not 32x I'm sure it will be several hundred percents faster. Building JtR with -j32 or not is a huge difference and it's nearly the same thing.

You could add a macro to opencl_DES_hst_dev_shared.h, for having it optional (in case there are problems on some systems, as Solar imagined).

#define HARDCODE_SALT          0
#define PARALLEL_BUILD         0
#define FULL_UNROLL            0
magnumripper commented 8 years ago

Bump! Now even more kernels seem to be used just for -test, and they don't even seem to be cached to disk? I had to cancel it after it compiled 45 kernels.

All formats except yours does cache binaries, except for nvidia devices which has caching in the driver. Regardless, I really think parallelizing builds with OpenMP would be a good idea.

magnumripper commented 8 years ago

Hmm I see on super's Tahiti, only 9 kernels were built for -test. And on Titan X, 12 were built. On my macbook using Intel Graphics HD5000, like I said I aborted it after 45... How come this difference?

magnumripper commented 8 years ago

Oh, maybe it's because of the "autotune fail"...

../run/john -test -form:descrypt-opencl
Device 1: Iris
Benchmarking: descrypt-opencl, traditional crypt(3) [DES OpenCL]... Possible auto_tune fail!!.
Salt compiled from Source:1
Salt compiled from Source:2
Salt compiled from Source:3
(...)
Salt compiled from Source:43
Salt compiled from Source:44
Salt compiled from Source:45
^CSession aborted
sayan1an commented 8 years ago

Are clCreateProgramWithSource and clGetProgramBuildInfo thread safe ? Even if they are, they operate on one variable 'program[sequential_id]', making it unsafe.

magnumripper commented 8 years ago

In OpenCL 1.1. all functions are thread-safe except for clSetKernelArg(). But that one is quick so we'd just call it sequentially.

sayan1an commented 8 years ago

I think program object needs to be thread safe. i.e 'program[sequential_id][thread_id]'. Otherwise we can't call clBuildProgram in parallel.

magnumripper commented 8 years ago

You may be right. We could probably work around it if we want to but it might be more complex than I hoped.

sayan1an commented 8 years ago

Apart from this, include_source() function is not thread safe. Also we must ensure kernel_source is not being modified while building kernel and no instance of opencl_read_source() is running in parallel.

magnumripper commented 8 years ago

Our own functions are no problem. I'll just make them thread-safe.

sayan1an commented 8 years ago

I just made some changes required for parallel build and thread safety. Please review 0dbaf3a5bacbd34.

magnumripper commented 8 years ago

I'll try to digest it. I was thinking we could pass a (thread local) buffer (or rather a pointer) to opencl_read_source() and then pass the same pointer to opencl_build() (or vice versa, something along the lines of that).

The fact we had hard-coded program[0] was kind of funny :fail:

sayan1an commented 8 years ago

program[0] is a local object. Not the global one. We'll be passing a thread local program buffer to opencl_build() and build_from_binary().

magnumripper commented 8 years ago

As far as I can see it's a global, declared as cl_program program[MAX_GPU_DEVICES];.

sayan1an commented 8 years ago

Unfortunately, we have same name for global program object and the pointer to program object(as function argument). I plan on using program object supplied by the format for descrypt-opencl. All other formats would be using global program object declared in common-opencl.

sayan1an commented 8 years ago

@magnumripper I have implemented parallel build 830f6a0b7d77. You may turn it on by setting PARALLEL_BUILD to 1 in opencl_DES_hst_dev_shared.h. However, we should really make path_expand() function thread safe in order to reduce the number of critcal sections and speed up build process.

magnumripper commented 8 years ago

Cool, I'll try it out.

magnumripper commented 8 years ago

I used Solar's way of pre-compiling all 4096 salts to benchmark this. While the format works fine with other test files, this one make it segfault:

$ perl -e '$c64 = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; foreach $c1 (split //, $c64) { foreach $c2 (split //, $c64) { print "$c1$c2...........\n"; } }' > pw-fakedes

$ head pw-fakedes 
.............
./...........
.0...........
.1...........
.2...........
.3...........
.4...........
.5...........
.6...........
.7...........

$ ../run/john pw-fakedes -form:descrypt-opencl -dev=2
Device 2: Tahiti [AMD Radeon HD 7900 Series]
Using default input encoding: UTF-8
Loaded 4096 password hashes with 4096 different salts (descrypt-opencl, traditional crypt(3) [DES OpenCL])
Salt compiled from Source:910
Salt compiled from Source:910
Salt compiled from Source:2275
Salt compiled from Source:990
Salt compiled from Source:0
Segmentation fault

$ ../run/john pw-fakedes -form:descrypt-opencl -dev=6
Device 6: GeForce GTX TITAN X
Using default input encoding: UTF-8
Loaded 4096 password hashes with 4096 different salts (descrypt-opencl, traditional crypt(3) [DES OpenCL])
Salt compiled from Source:910
Salt compiled from Source:910
Salt compiled from Source:2275
Salt compiled from Source:990
Salt compiled from Source:0
Segmentation fault
magnumripper commented 8 years ago
$ gdb --args ../run/john pw-fakedes -form:descrypt-opencl -dev=6
(gdb) r
Starting program: /home/magnum/src/john/run/john pw-fakedes -form:descrypt-opencl -dev=6
Device 6: GeForce GTX TITAN X
Using default input encoding: UTF-8
Loaded 4096 password hashes with 4096 different salts (descrypt-opencl, traditional crypt(3) [DES OpenCL])
Salt compiled from Source:910
Salt compiled from Binary:910
Salt compiled from Binary:2275
Salt compiled from Binary:990
Salt compiled from Binary:0
Program received signal SIGSEGV, Segmentation fault.
0x000000000074fc23 in remove_duplicates_64 (num_loaded_hashes=1, hash_table_size=128, verbosity=0)
    at bt_hash_type_64.c:440
440             loaded_hashes_64[i] = loaded_hashes_64[num_unique_hashes];
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6_6.7.x86_64 gmp-4.3.1-7.el6_2.2.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-37.el6_6.x86_64 libX11-1.6.0-2.2.el6.x86_64 libXau-1.0.6-4.el6.x86_64 libXext-1.3.2-2.1.el6.x86_64 libXinerama-1.1.3-2.1.el6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libgomp-4.4.7-11.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libstdc++-4.4.7-11.el6.x86_64 libxcb-1.9.1-2.el6.x86_64 mesa-libGL-10.1.2-2.el6.x86_64 nss-softokn-freebl-3.14.3-22.el6_6.x86_64 numactl-2.0.9-2.el6.x86_64 opencl-1.2-intel-cpu-3.1.1.11385-1.x86_64 opencl-1.2-intel-mic-3.1.1.11385-1.x86_64 openssl-1.0.1e-30.el6_6.11.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x000000000074fc23 in remove_duplicates_64 (num_loaded_hashes=1, hash_table_size=128, 
    verbosity=0) at bt_hash_type_64.c:440
#1  0x000000000074cd62 in create_perfect_hash_table (htype=<value optimized out>, 
    loaded_hashes_ptr=<value optimized out>, num_ld_hashes=1, offset_table_ptr=0x7fffffffa0d0, 
    offset_table_sz_ptr=0x7fffffffa0d8, hash_table_sz_ptr=0x7fffffffa0dc, verb=0) at bt.c:680
#2  0x00000000005de2a2 in fill_buffer (salt=<value optimized out>, 
    max_uncracked_hashes=<value optimized out>, max_hash_table_size=0xd0c2a4)
    at opencl_DES_bs_plug.c:224
#3  0x00000000005de842 in build_tables (db=<value optimized out>) at opencl_DES_bs_plug.c:423
#4  0x00000000005d678f in reset (db=0xdb3da0) at opencl_DES_bs_f_plug.c:645
#5  0x00000000007053d7 in john_run () at john.c:1587
#6  0x0000000000705bc7 in main (argc=4, argv=0x7fffffffe3d8) at john.c:1883
(gdb) 
magnumripper commented 8 years ago

Hmm @Sayantan2048 is the problem that we have 4096 unique salts but only 1 unique binary?

magnumripper commented 8 years ago

No, that doesn't seem to be it.

magnumripper commented 8 years ago

BTW note that I did not even enable OpenMP builds yet! I was going to make a baseline first.

sayan1an commented 8 years ago

On Wed, Oct 7, 2015 at 3:13 AM, magnum notifications@github.com wrote:

BTW note that I did not even enable OpenMP builds yet!

— Reply to this email directly or view it on GitHub https://github.com/magnumripper/JohnTheRipper/issues/1618#issuecomment-146010227 .

Another edge case issue!! Binaries are all zero!! These tables use zero to mark invalid/duplicate hashes.

magnumripper commented 8 years ago

LOL. OK, the following works better

perl -e '$c64 = "./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"; foreach $c1 (split //, $c64) { foreach $c2 (split //, $c64) { print "$c1${c2}Nf8Sbh3HDfQ\n"; } }' > pw-fakedes
magnumripper commented 8 years ago

It really shouldn't segfault though, that's nasty.