Open solardiz opened 6 months ago
FWIW, the contents of out
after the pre_processing
kernel on Intel and AMD OpenCL on CPU match GPU's (so must be correct). On Intel HD Graphics, they don't match, so we seem to have/trigger a separate bug there.
So, not surprisingly, the main issue appears to be beyond pre-processing. This is consistent with this format already failing on CPUs before @alainesp moved the pre-processing from host to device.
Overriding these didn't make a difference (still works on GPUs, fails on CPUs):
#define upsample(a, b) (((ulong)(a) << 32) | (b))
#define mul_hi(a, b) ((ulong)(a) * (b) >> 32)
With the below hack and shmemSize
forced to 32 KiB, it still works on a GPU, but still fails on CPUs like before:
- uint warp = (get_local_id(1) * get_local_size(0) + get_local_id(0)) / THREADS_PER_LANE;
+ uint warp = (get_global_id(1) * get_global_size(0) + get_global_id(0)) / THREADS_PER_LANE;
So the issue is probably not specific to behavior of get_local_*
on CPU.
Maybe we should print a warning to the user when detecting CPU or Intel GPUs besides the self-test fail? Explain the situation a little more.
In #5420, @magnumripper shows a macOS system where the format works for the first few test vectors on HD Graphics (edit: specifically, on Intel(R) UHD Graphics 630
), only failing at FAILED (cmp_one(10))
.
A known shortcoming/bug of the
argon2-opencl
format is that it fails self-test on CPU(-like) devices, as tested with ancient Intel OpenCL and AMD APP SDK that we have on our online dev boxes and with recent Intel OpenCL that @alainesp has on his laptop. We don't know exactly why - a guess is this has something to do with our usage of local memory.The format works on most GPUs, the only exception identified so far being Intel HD Graphics, where it also fails.
The failures on CPUs and Intel GPU are
FAILED (cmp_one(1))
. The failure on MIC includes segfaults.