Class of Bijective 128-Bit Hashes for GPU

skeeto / hash-prospector

Automated integer hash function discovery

The Unlicense

672 stars 27 forks source link

Class of Bijective 128-Bit Hashes for GPU #26

Open myth0genesis opened 1 year ago

myth0genesis commented 1 year ago

First of all, big fan of this class of hashes. Came across your work through Shadertoy and went way too far down the rabbithole trying to see if I could come up with a good adaption of this for 128-bit hashes on a GPU. I don't really know any C, so I couldn't give you the C equivalent here, but these are super fast on a GPU because 4-element vectors can swap 32-bit values between them at zero cost. I found an article by Marc Reynolds http://marc-b-reynolds.github.io/math/2017/10/13/XorRotate.html about reversing double XOR-rotates and came to the realization that a vector swizzle is just a rotation by multiples of 32 bits. So the GLSL code goes like this:

const uvec4 shft = uvec4(14U, 15U, 16U, 17U);
uvec4 bjhash128(uvec4 p) {
    p ^= p >> shft;
    p *= uvec4(0xEAF649A9U, 0x6AF649A9U, 0x050C2D35U, 0xAAF649A9U);
    p ^= p.yzwx ^ p.zwxy;
    p ^= p >> shft.yzwx;
    p *= uvec4(0x21F0AAADU, 0x0D0C2D35U, 0xE8F649A9U, 0xD30332D3U);
    p ^= p.yzwx ^ p.zwxy;
    return p ^ p >> shft.wxyz;
}

And the inverse:

uvec4 bjhash128_inv(uvec4 p) {
    p ^= p >> shft.wxyz;
    p.yz ^= p.yz >> (shft.xy << 1U);
    p = dblxoro128_inv(p);
    p *= uvec4(0x333C4925U, 0x12DC7D1DU, 0x349E6A99U, 0x3B43F55BU);
    p ^= p >> shft.yzwx;
    p.xw ^= p.xw >> (shft.yx << 1U);
    p = dblxoro128_inv(p);
    p *= uvec4(0x529E6A99U, 0xD29E6A99U, 0x5ADC7D1DU, 0x929E6A99U);
    p ^= p >> shft;
    p.xy ^= p.xy >> (shft.xy << 1U);
    return p;
}

And the Shadertoy is here https://www.shadertoy.com/view/mstXD2 with further explanation and some basic debugging tools. Obviously the above code was written with vector parallelism in mind, so I apologize if this doesn't translate to C very well, but I'm hoping this might give you a good starting point for developing a faster 128-bit variant in C. I was rolling around the idea of writing a compute shader to do what your Hash Prospector does, but on the GPU so as to be able to scale it efficiently for 128-bit testing. As such, I still haven't done any tests for statistical robustness other than a visual inspection, so there's room for improvement in choosing the correct constants and bitwise shifts. That being said, all the bits seem to be visually random, both the lowest and highest bits, so that gives me hope that this will be a feasible variant. I'm just curious as to what your thoughts are.

myth0genesis commented 1 year ago

Here's the double XOR-rotate inversion function I forgot to include:

uvec4 dblxoro128_inv(uvec4 x) {
    for (int i = 0; i < 7; i++) x ^= x.yzwx ^ x.zwxy;
    return x;
}

This can probably be collapsed somewhat. I'll try to do it when I have time.

myth0genesis commented 1 year ago

Okay. I found that only three, instead of seven, iterations of the rotation inversions above will work. I collapsed the inversion function by simplifying the cyclic permutations:

(I+C³²+C⁶⁴)(I+C³²+C⁶⁴)(I+C³²+C⁶⁴) (I+C³²+C⁶⁴)(I(I+C³²+C⁶⁴)+C³²(I+C³²+C⁶⁴)+C⁶⁴(I+C³²+C⁶⁴)) (I+C³²+C⁶⁴)((2I+C³²+C⁶⁴)+(C³²+C⁶⁴+C⁹⁶)+(C⁶⁴+C⁹⁶+C⁰)) (I+C³²+C⁶⁴)(C³²+C⁶⁴+C³²+C⁶⁴+C⁹⁶+C⁶⁴+C⁹⁶+C⁰) (I+C³²+C⁶⁴)(2C³²+3C⁶⁴+2C⁹⁶+C⁰) (I+C³²+C⁶⁴)(C⁶⁴+C⁶⁴+C⁶⁴+C⁰) (I+C³²+C⁶⁴)(3C⁶⁴+C⁰) (I+C³²+C⁶⁴)(C⁶⁴+C⁰) I(C⁶⁴+C⁰)+C³²(C⁶⁴+C⁰)+C⁶⁴(C⁶⁴+C⁰) I+C⁶⁴+C⁹⁶

Which means the inversion function above can be simplified to:

uvec4 dblxoro128_inv(uvec4 x) {
    return x ^ x.zwxy ^ x.wxyz;
}

The Shadertoy link I gave earlier has been updated to reflect this.

tommyettinger commented 1 year ago

This is very cool. I'm not sure how hash-prospector could test this, since even getting reasonable comparisons of 64-bit hashes is challenging... The component swizzling approach for uvec4 seems solid, and I can't find any easy ways to speed up your hash (maybe the inverse can be faster if shft uses only values that are 16 or greater, but I don't know if that worsens quality). I think modern CPUs can work with groups of four 32-bit uints using SIMD instructions, so this could be applicable to more than GPUs.

fp64 commented 1 year ago

I don't really know any C, so I couldn't give you the C equivalent here

For modern GCC/clang (and icc/icx), I think, it translates rather straightforwardly:

typedef uint32_t uvec4 __attribute__((vector_size(16)));

uvec4 bjhash128(uvec4 p) {
    #define SWIZZLE(v, a, b, c, d) (uvec4){v[a], v[b], v[c], v[d]}
    const uvec4 shft = (uvec4){14U, 15U, 16U, 17U};
    p ^= p >> shft;
    p *= (uvec4){0xEAF649A9U, 0x6AF649A9U, 0x050C2D35U, 0xAAF649A9U};
    p ^= SWIZZLE(p, 1, 2, 3, 0) ^ SWIZZLE(p, 2, 3, 0, 1);             //p ^= p.yzwx ^ p.zwxy;
    p ^= p >> SWIZZLE(shft, 1, 2, 3, 0);                              //p ^= p >> shft.yzwx;
    p *= (uvec4){0x21F0AAADU, 0x0D0C2D35U, 0xE8F649A9U, 0xD30332D3U};
    p ^= SWIZZLE(p, 1, 2, 3, 0) ^ SWIZZLE(p, 2, 3, 0, 1);             //p ^= p.yzwx ^ p.zwxy;
    return p ^ p >> SWIZZLE(shft, 3, 0, 1, 2);                        //return p ^ p >> shft.wxyz;
    #undef SWIZZLE
}

Of course, you can also use intrinsics for specific instruction set (SSE2/NEON/whatever) instead.

Note: parenthesizing (uvec4) like this is only necessary in C, not in C++, but does not hurt.

This definition of SWIZZLE optimizes ok (at least in this instance), but you also can do:

    #define SWIZZLE(v, a, b, c, d) __builtin_shufflevector(v, v, a, b, c, d) // <-- clang (6.0.0+) and newer GCC (12.1+)
    #define SWIZZLE(v, a, b, c, d) __builtin_shuffle(v, uvec4{a, b, c, d})   // <-- older GCC (4.8.1+), but not clang

As for the hash itself, bjhash128(uvec4(counter,0,0,0)).x fails PractRand (-te 0 -tf 2) at 128KB (still better than some popular hashes).

fp64 commented 1 year ago

Note that bjhash128(uvec4(counter,0,0,0)).y is much better - it fails at 16GB (any uint32->whatever hash must fail at 32GB or earlier, due to input repeating). And bjhash128(uvec4(0,counter,0,0)).y fails at 8GB, so it's not like diagonal elements are necessarily bad. Edit: however bjhash128(uvec4(0,counter,0,0)).x fails at 16KB, bjhash128(uvec4(0,0,counter,0)).x at 128 KB, and bjhash128(uvec4(0,0,0,counter)).x also at 128 KB. This seems pretty clear indication that something is going on with x component of output.

All 16 possible versions of this pass the birthday test, incidentally.

fp64 commented 1 year ago

Constants being so similar

D30332D3
050C2D35
0D0C2D35
E8F649A9
6AF649A9
AAF649A9
EAF649A9
21F0AAAD

looks a bit suspicious.

Edit: just realized constants are from https://github.com/skeeto/hash-prospector/issues/19.

myth0genesis commented 1 year ago

Yeah, the constants were arbitrarily chosen from the list in #19 just to get something to start off with. There was no statistical motivation for choosing them, but I'm reasonably confident that tweaking the constants and shift values could yield better results. It was somewhat disappointing to see that one of the swizzles failed at just 16KB. I'm still working on a battery of tests for the GPU. I started off with OpenGL 4.6, but decided to switch to WGSL, as the WebGPU API is supposed to be implemented at large in Chromium-based browsers sometime near the end of May, I believe, and it supports compute shaders. It would also probably be quicker to prototype something since I wouldn't have to compile any of the code before running it.

fp64 commented 1 year ago

It just how the constants interact, e.g.

    p *= uvec4(0xEAF649A9U, 0x6AF649A9U, 0x050C2D35U, 0xAAF649A9U);
    p ^= p.yzwx ^ p.zwxy;

is the same as

    p = uvec4(p.x * 0xEAF649A9U, p.y * 0x6AF649A9U, p.z * 0x050C2D35U, p.w * 0xAAF649A9U)
      ^ uvec4(p.y * 0x6AF649A9U, p.z * 0x050C2D35U, p.w * 0xAAF649A9U, p.x * 0xEAF649A9U)
      ^ uvec4(p.z * 0x050C2D35U, p.w * 0xAAF649A9U, p.x * 0xEAF649A9U, p.y * 0x6AF649A9U);

and resulting p.w looks (w * M) ^ (x * M) ^ (y * M) when viewed mod $2^{30}$. Not sure if it affects quality.

Of course, in the above tests only one input component was non-zero anyway.

I myself tend to use constants from https://gist.github.com/Marc-B-Reynolds/53ecc4f9648d9ba4348403b9afd06804 with equally little justification.

Edit: the other thing, shift amounts being a linear progression {14,15,16,17} may interact undesirably with swizzles moving components by 1 place (or maybe not).

myth0genesis commented 1 year ago

The shifts being linear were also an arbitrary choice, as I swizzle the constant vector that contains them between steps, hoping that would be enough. I also thought about using const uvec2 shft = uvec2(15U, 16U); and just using different swizzles like

p ^= p >> shft.xyxy;
p ^= p >> shft.xyyx;
p ^= p >> shft.xxyy;

etc... But when I tried that, the visual randomness seemed to suffer. Experimenting with different swizzles of the original 4-vector might improve things, though, like

p ^= p >> shft.xywz;
p ^= p >> shft.zywx;

and you're not constrained to using all the swizzles, so an option could be

p ^= p >> shft.yywz;
p ^= p >> shft.xzzy;

and the like.

Adam-Vandervorst commented 1 year ago

Interesting to see generalization to higher bit counts. I'm particularly interested in high bit counts (2^11-2^16) and wonder how well these compose.

That is, insofar can you apply different generated hash functions one after the other instead of applying one repeated over the whole bitstring? Some early thoughts:

Intertwining macro-level permutations/shifts in view with different rounds
Gather single-bit summaries (like bit count parity) and use this as a global mix in constant
Hierarchical setups where the higher level information is integrated back into the lower level up till the uint32 level
Especially suitable to 2^12: transposing the stack of 64 bit hashes
Doing passes over the whole length with a "running hash" modifying some state for efficient 32/64/SIMD hashes.

They don't quite need to be secure for my use-case, but anything naive fails (like literally putting them side-to-side because the effects of bit changes stay within their bucket).

myth0genesis commented 1 year ago

If you wanted to extend the hash above to higher bit counts, it would probably work fine for any power-of-2-times-32-bit inputs, but in order to ensure bits propagate across the entire set, you would have to do more more rounds of double-XOR-swizzles.

I'm not sure why you would want to compare hash functions directly. Could you explain your reason for that? It might help me answer your question more clearly. The typical way to test the quality of a hash is to use a battery of tests that output some kind of p-value where the threshold set is where a truly random set of numbers would pass every test (though false negatives do occasionally come up, even among sets of random numbers). So in essence, this is equivalent to testing a hash against the set of random numbers itself and the comparison comes from how closely each hash stacks up against that.

As far as directly comparing hashes, depending on what exactly you want to do, it could be trivially fast on a GPU in a fragment or compute shader, or it could be as rigorous as running a battery of tests like those mentioned above, possibly even more so. A direct comparison of output values or bits could be done in realtime with a very large set of inputs in a fragment shader and would be very easy to implement. Checking for bit pattern overlaps between hashes would be far more intensive, though, especially among bijective hashes, as the full period length of the input state is so prodigiously large that it's not reasonable to check the full domain (128 bits have over 340 undecillion possible outputs). It's likely that using the results of a test to then drive another hash (which is what it sounds like you want to do) in realtime would be very difficult, and maybe not even possible, depending of what kinds of comparisons you want to drive it with.

Bijective hashes may be used for encryption, but that doesn't necessarily mean they are "secure". It depends on what you want. If you want something fast, then you'll likely have to sacrifice quality. If you want something more random, you'll probably have to forgo speed. And finding something in between is where testing hashes is most invaluable.

myth0genesis commented 1 year ago

Just thought I'd give an update on my GPU hash test suite progress. I'm too clumsy with GitHub to really maintain a codebase here yet, as I'm still pretty new to the platform. Since I'm going with the NIST SP 800-22 recommendations for tests, I needed to build a small codebase of fast (though slightly less accurate) variants of some of the functions in the cephes.h and math.h libraries. Namely, I've made the shader language equivalents of the erf() and gamma-related functions which are so ubiquitous throughout the last steps of each test the NIST recommends. Some tests will be blazing-fast on a GPU, like the random walk test, where you could literally test strings of hundreds of millions of bits in less than a second on a modern mid-tier graphics card. Even the discrete Fourier transform test can be performed on a very large set in almost no time at all. But the monobit frequency test is a bit trickier regarding fast memory management and parallelism. It's a bit more involved trying to keep as many of the results of the different stages of computation on GPU memory without leaving a huge gap of under-utilized memory somewhere or having to write to and read from non-volatile memory. The main difficulty, though, is in workgroups being limited to a 3D structure, but needing to do computations in 4D without having too much overhead in doing computationally-heavy transforms from the workgroup IDs to the integer space in a 4x32-bit hash.

tommyettinger commented 1 year ago

It seems odd that the NIST recommends using erf() for evaluating statistical properties of a digital algorithm, since there are many different approximations of it but it doesn't seem to be directly calculatable. I wonder what alternative hash tests like SMHasher use... It sounds like you've already done a lot of hard work, thank you for that!

myth0genesis commented 1 year ago

In the case of the NIST's recommendations, erf() and igamc() only seem to be used in the final step of tests, namely calculating p-values, so there really shouldn't be much in the way of cumulative error issues. The erf() approximation I settled on is Winitzki's, as it's a lot faster than using continued fractions and is more accurate than the Bürmann series. It's accurate to about 7 significant digits. The upper incomplete gamma function was a bit tougher, as I went through about a dozen "approximation" methods that seemed to work fine with really high upper bounds values, but were terrible when the upper bounds were small and close to the lower bounds. Some were off by more than a factor of 2 with lower values. I actually had to look in a 1992 publication of "Numerical Recipes in FORTRAN 77" to find that the best way was just to use the odd-number continued fractions to get it to converge really quickly. The accuracy I've found to be anywhere between 6 and 11 significant digits with just four iterations, with the accuracy being on the higher end for lower values. An accuracy of 3 significant digits is all that's needed for the NIST's tests, as the p-value threshold is 0.01 for all of them, as far as I'm aware. Either way, all of the transcendental versions of these functions can be represented as a continued fraction with integers, so the accuracy can be whatever a user desires. Good point, @tommyettinger! I should perhaps give users the option to choose which form of the functions to use, either the faster ones or the transcendental ones where they could choose the number of iterations for each that suit their needs.