skeeto / hash-prospector

Automated integer hash function discovery
The Unlicense
672 stars 27 forks source link

[Q] Seeded integer hash functions #16

Open hauntsaninja opened 3 years ago

hauntsaninja commented 3 years ago

Thank you for the cool project and writeup!

I was curious — do you have recommendations for integer hash function families? That is, given some amount of seed entropy, choose an integer hash function; for example, a naive thing one could do with a 32 bit seed is return lowbias32(x ^ seed).

Bulat-Ziganshin commented 3 years ago

the first idea that come to mind crc32(uint64(x)*seed), of course for seed!=0 and 32-bit x value

hauntsaninja commented 3 years ago

Thanks! I guess I meant "integer hash function" in the sense that the README uses it, which seems to imply reversibility. crc32(uint64) in general can't be reversible, but maybe I'm misunderstanding something.

Logan007 commented 3 years ago

a naive thing one could do with a 32 bit seed is return lowbias32(x ^ seed)

From my experience getting Pearson B. Hashing to pass SMHasher especially with seeding, I would use some scramble of the seed, e.g. lowbias32( x ^ scramble(seed) ); where scramble() does not necessarily need to be the full lowbias() – though, it could.

For example, and depending on your needs, a simple two-step seed ^= seed >> 13; seed *= 0x7896b6ab; might be sufficient for scrambling the seed upfront.

skeeto commented 3 years ago

You're on the right track. XORing the input is what I've done in the past myself. It's safe in that it cannot negatively affect the permutation regardless of the seed. It's is better than ADDing (hash32(x + seed)) which merely rotates the permutation, and definitely better than XORing the output, in which case the seed doesn't play a role in avalanche.

I haven't done any thorough testing on this yet, but another possibility I've considered is mixing the seed in the middle:

uint32_t hash32(uint32_t x, uint32_t seed)
{
    x ^= x >> A;
    x *= B;
    x ^= x >> C;
    x ^= seed;
    x *= D;
    x ^= x >> E;
}

This more fundamentally effects the permutation, though maybe in a bad way (why this needs more testing). Also, is it better before or after the middle xorshift? I don't know yet.

There are also other exotic options:

// WARNING: This is just for illustration purposes!
uint32_t hash32(uint32_t x, uint32_t seed)
{
    uint32_t hi = (seed >> 15) & 0x1fffe;
    uint32_t lo = (seed <<  1) & 0x1fffe;
    x ^= x >> A;
    x *= B ^ hi;
    x ^= x >> C;
    x *= D ^ lo;
    x ^= x >> E;
}

However changing the multiplier constants undermines the whole point of using the good constants, and some seeds are going to create lousy permutations. Probably not worth it.

Bulat-Ziganshin commented 3 years ago

afair, crcN(uintN) is always reversible function, so author can include it in his list of reversible primitives

that said, half of his primitives are parameterized, so you can use any sequence of the primitives that employs enough entropy from the seed. but since crc and mul are better bit-mixers than other ops, crc(x*(seed|1)) still looks for me like the most efficient reversible hash function

and regarding quality, you can reuse his code to compute AVERAGE bias for random-generated operation parameters instead of searching for MINIMAL one, for various operation sequences. actually, it would make nice addition to the current mode of operation

some seeds are going to create lousy permutations

they can't do it as far as each individual operation is reversible

I would use some scramble of the seed

it's useless except when your seeds are from extremally poor PRNG such as return n++. It may be useful only if you use seed more than once such as

x *= seed
x <<<= 17
x *= scramble(seed)
hauntsaninja commented 3 years ago

Thanks all!

lowbias32( x ^ scramble(seed) );

Makes sense, luckily for my use case I know the seeds will be good.

XORing the input is what I've done in the past myself.

Good to know I'm not doing something terrible. Although it does seem to me that lowbias32((x + 1) + seed) == lowbias32(x + (1 + seed)) is very similar to lowbias32((x ^ 1) ^ seed) == lowbias32(x ^ (1 ^ seed)), so given high quality seeds I'm not sure I see that XOR is better than ADD.

I'm also curious about thoughts on how to use, say, 128 or 256 or more bits of seed. The naive thing I thought of is repeatedly mixing the seed in the middle, but maybe there's something more efficient:

for seed in seeds:
    x = lowbias32(x ^ seed)

 compute AVERAGE bias for random-generated operation parameters instead of searching for MINIMAL one

Yeah, measurement is an interesting question (and the path forward from here). You'd want to ensure hash functions with different seeds aren't correlated (e.g., if you just ignored the seed, lowbias32 would perform pretty well if you just examined each seeded hash function in isolation). I have a measurement that's specific to my use case, but it's pretty blunt. Like skeeto, I'm a little untrusting of solutions that mix seeds into the multiplier constants; your suggestion is probably a good way to validate whether that's a good idea.

Logan007 commented 3 years ago

128 or 256 or more bits of seed

This calls for cryptography. SPECK offers a version featuring 32-bit block size and 64-bit key size.

Other than that, if you implemented reversible bit-permutation as an intermediate step, it would consume log₂ (32!) ≈ 118 bits of your seed to name a specific permutation. Leaves a few more for an XOR mask to invert some of the bits – or CLMUL, see below... However, I doubt that bit-permutation meets computation efficiency requirements... :wink:

x = lowbias32(x ^ seed)

Why not use hardware-accelerated carry-less multiplication CLMUL instead of XOR? ./prospector -p mul,xorr,clmul -t 1000 gives better results, i.e. lower bias, than ./prospector -p mul,xorr,xor -t 1000. In this case, seed's LSB is always set, so it takes 31 bits off your long seed.

CLMUL changes the (2³² --> 2³²) function's structure in a more complex way than adding or exclusive-oring a constant at some (intermediate) point(s). It steers how and which of x's bits are XORed together for the result. Due to 32-bit cut-off, the left bits get mixed most (otherwise, the middle ones) while the lower bits see less mixing.

Also, I would not speak against arithmetic multiplication by the seed (with LSB set) as long as it remains an additional intermediate step and does not replace the multiplication with the good constants – and well, neither immediately follows nor directly precedes it to not de-associativate its effect. Compared to CLMUL, the propagated carry will have an increasing smearing effect on the upper-most bits (this imbalance is mitigated by the following and well matching xor-shift-right in mul,xorr-patterns, bringing back parts of it down to the lower bits).

The CLMUL-step, lacking the carry, could be combined with a prepended shift-right-add (ADDR not implemented yet, but you can modify ADDL's 0xe7 to 0xef) to leverage some carry-smearing at least.

P.S. Still calculating log₂ (2³²!) for optimal minimum seed length... :wink:

pspeter3 commented 2 years ago

@skeeto have you considered testing the bias of the options you've suggested vs the XOR? In the meantime, would you recommend XOR?

pspeter3 commented 2 years ago

Related, could use the XOR method for a list of integers? Eg

declare function triple32inc(x: number): number;

function hash(seed: number, values: number[]): number {
    for (const value of values) {
        seed = triple32inc(seed ^ value);
    }
    return seed;
}
account-login commented 2 years ago

What's the purpose of seeding? I think the purpose of seeding can be viewed as generating different permutations, with this in mind, xor the input with seed is not sufficient.

Consider two permutations, one that maps 0, 1, 2, ..., 2^32-1 to h(0), h(1), h(2^32-1), the other maps 0, 1, 2, ..., 2^32-1 to h(0^seed), h(1^seed), h((2^32-1)^seed). Now consider 2 adjacent input that differs only on the LSB, after xor with the seed, the 2 input still differs only on LSB, thus still on an adjacent position, so adjacent pairs on the first permutation output can be found on the second permutation output, thus the permutation is not changed sufficiently.

Another view on this is that the hash function itself is multiple rounds of weak permutation stacked on top of another, xor the input can be viewed as adding another round of weak permutation to the hash function, which does not change the hash function much.

By the way, I have tested ideas of adding seed to the hash32 function.