argon2-opencl fails on CPU and MIC

openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs

https://www.openwall.com/john/

9.63k stars 2.04k forks source link

argon2-opencl fails on CPU and MIC #5417

Open solardiz opened 6 months ago

solardiz commented 6 months ago

A known shortcoming/bug of the argon2-opencl format is that it fails self-test on CPU(-like) devices, as tested with ancient Intel OpenCL and AMD APP SDK that we have on our online dev boxes and with recent Intel OpenCL that @alainesp has on his laptop. We don't know exactly why - a guess is this has something to do with our usage of local memory.

The format works on most GPUs, the only exception identified so far being Intel HD Graphics, where it also fails.

The failures on CPUs and Intel GPU are FAILED (cmp_one(1)). The failure on MIC includes segfaults.

solardiz commented 6 months ago

FWIW, the contents of out after the pre_processing kernel on Intel and AMD OpenCL on CPU match GPU's (so must be correct). On Intel HD Graphics, they don't match, so we seem to have/trigger a separate bug there.

So, not surprisingly, the main issue appears to be beyond pre-processing. This is consistent with this format already failing on CPUs before @alainesp moved the pre-processing from host to device.

solardiz commented 6 months ago

Overriding these didn't make a difference (still works on GPUs, fails on CPUs):

#define upsample(a, b) (((ulong)(a) << 32) | (b))
#define mul_hi(a, b) ((ulong)(a) * (b) >> 32)

solardiz commented 6 months ago

With the below hack and shmemSize forced to 32 KiB, it still works on a GPU, but still fails on CPUs like before:

-       uint warp   = (get_local_id(1) * get_local_size(0) + get_local_id(0)) / THREADS_PER_LANE;
+       uint warp   = (get_global_id(1) * get_global_size(0) + get_global_id(0)) / THREADS_PER_LANE;

So the issue is probably not specific to behavior of get_local_* on CPU.

alainesp commented 6 months ago

Maybe we should print a warning to the user when detecting CPU or Intel GPUs besides the self-test fail? Explain the situation a little more.

solardiz commented 3 months ago

In #5420, @magnumripper shows a macOS system where the format works for the first few test vectors on HD Graphics (edit: specifically, on Intel(R) UHD Graphics 630), only failing at FAILED (cmp_one(10)).