openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
9.99k stars 2.07k forks source link

OpenCL support for eCryptfs and other formats with similar inner loop #358

Open kholia opened 11 years ago

kholia commented 11 years ago

Algorithm is 65536X SHA-512.

@ukasz you recently worked with SHA-512. are you interested in this one too?

ukasz commented 11 years ago

Sure, that's easy one. But I don't think that we need CUDA support for this.

kholia commented 11 years ago

Sounds good. Hoping that it would land soon in bleeding-jumbo.

solardiz commented 3 months ago

The main loop for ecryptfs is exactly the same as it is for several other formats, so perhaps we could have a piece of shared OpenCL code (shared kernel or just a shared function used by several kernels) and even a shared FPGA design/bitstream. Right now, we have this in OpenCL for bitcoin-opencl, but not for any others that use the exact same loop.

Exact same loop in:

bitcoin_fmt_plug.c:     SIMDSHA512body(key_iv, key_iv, &rounds, SSEi_HALF_IN|SSEi_LOOP);
blackberry_ES10_fmt_plug.c:     SIMDSHA512body(keys, keys64, &rounds, SSEi_HALF_IN|SSEi_LOOP);
ecryptfs_fmt_plug.c:        SIMDSHA512body(keys, keys64, &rounds, SSEi_HALF_IN|SSEi_LOOP);
pkcs12_plug.c:      SIMDSHA512body(sse_buf, (uint64_t*)sse_buf, &rounds, SSEi_HALF_IN|SSEi_LOOP);

Also similar in:

armory_fmt_plug.c:      SIMDSHA512body(x, lut[1][0].u64, lut[n][0].u64, SSEi_HALF_IN|SSEi_LOOP|SSEi_FLAT_OUT);
drupal7_fmt_plug.c:     SIMDSHA512body(keys, keys64, &Lcount, SSEi_MIXED_IN|SSEi_LOOP|SSEi_OUTPUT_AS_INP_FMT);

Drupal7's is very similar and could be shared code too. We don't have it in OpenCL, but we do have it on FPGA (ZTEX). A slight revision of the existing FPGA design (perhaps just a third program for the soft CPUs) should make them usable for the first four formats listed above.

Armory's is dissimilar in that it needs to save the output from each iteration, and in that it's only part of the total processing, whereas the other major part is inefficient on GPU. Yet perhaps SHA-512 is slow enough that a GPU+CPU design is possible, splitting the parts of processing, such that the CPU part for the current batch of candidates would overlap with the GPU part (and transfer to host) for the next. This would perhaps double or triple the speed.