Woodland perfect hash - Githubissues

Kubuxu commented 6 months ago

This lookup table works with no collisions because each mansion seed has a unique lower 32 bits, somehow. I don't understand why that's true, it's fascinating. You'd think it wouldn't work. But I think the coefficients 341873128712 and 132897987541 may have been specially chosen to make this work? Like, if you have 2.2 billion marbles, and 4.3 billion buckets, and you independently put each marble in a random bucket, what are the odds that each marble gets its own bucket? Essentially zero. Nearing the end, each new marble has a more than 50% chance of hitting a bucket that's already filled. Yet, clearly, these are not independently random, so somehow it works. Unironically if you're reading this and understand how this works or why those two specific coefficients make this work, please let me know.

To shine some light on the topic. The woodland seed function is a perfect hash of the cords (x,z) in the limited range. It is from the family of linear hash functions. Usual construction would require that parameters are co-prime to modulus and each-other but due to limited range of x and z the function is still injective despite one of the parameters not being co-prime with the modulus.

In case of only one input, if the parameter is co-prime to the modulus, it will create an additive group of size N, meaning it it will create a perfect hash for all values [0, N). (I can write more on it latter).

Lightwood13 commented 6 months ago

Hey @leijurv. Here's an explanation I came up with, why the mansion seed's lower 32 bits happen to have no collisions, and it turns out it involves some lattices again!

The seed is calculated as

seed = x * 341873128712 + z * 132897987541 - 4172144997891902323 mod 2^48

or introducing some notation:

seed = ax + bz + c mod 2^48

We're looking at the lower 32 bits, i.e. seed mod 2^32 = ax + bz + c mod 2^48 mod 2^32 and because 2^48 is divisible by 2^32 we can drop the inner mod and get

hash(x, z) = ax + bz + c mod m, m = 2^32

We want to check if there are any collisions:

hash(x1, z1) = hash(x2, z2)
ax1 + bz1 + c = ax2 + bz2 + c mod m
a(x1 - x2) + b(z1 - z2) = 0 mod m
aΔx + bΔz = 0 mod m

where Δx = x1 - x2, Δz = z1 - z2. Now we can use the fact, that b is invertible mod m:

Δz = -ab^(-1)Δx mod m

or introducing k = -ab^(-1) mod m, which is k = 1675244312 for our case:

Δz = kΔx mod m   (1)

So, two points (x1, z1) and (x2, z2) have the same hash if and only if differences in their coordinates satisfy equation (1). Now, because (1) is a simple linear relation, if we plot points, that satisfy it, we get a lattice! It's basis vectors are (1, k) and (0, m). After performing LLL reduction for our case we get a reduced basis (27794, -48208) and (56211, 57032).

Now all we have to do, is to check that no pair of our input points have differences of coordinates Δx and Δz that are contained in this lattice. Our input points are contained in a square -23440 < x < 23440, -23440 < z < 23440. That means Δx and Δz lie in a square that is twice as big: -46880 < Δx < 46880, -46880 < Δz < 46880 (to understand why, imagine two input points on opposite sides of the square, and what their coordinate differences would be). Plotting this square together with a lattice we can see, that it indeed doesn't overlap with any lattice points (except (0, 0), but it's not a hash collision, because then x1 = x2, z1 = z2, so it's the same point):

Red arrows are the reduced basis vectors.

Now let's look into the question of whether this is a rare situation, that is caused by a specific selection of coefficients a and b or would it work with any coefficients. On the first glance, it may seem strange that points in our square generate 46880^2 = 2197734400 unique hashes, that is more than half of possible 2^32 values. But let's estimate the average distance between points in the lattice.

If we change Δx from 0 to m - 1, and calculate Δz from equation (1), then Δz will also change in the range [0, m - 1] and we should thus have m lattice points that fill a square 0 <= Δx < m, 0 <= Δz < m. It seems logical to assume, that if our hash function is supposed to look random, then lattice points would be distributed almost evenly in all directions (and also length of both reduced basis vectors will be close to each other). If m points are distributed evenly inside a square of side m, which has an area A = m^2, that means area per point is Ap = A / m = m and the distance between neigbouring points would be on the order of sqrt(Ap) = sqrt(m). This means that there should be an area free of any lattice point around origin with the size on the order of Ap = m. In other words, around any point (x, z) there's an area that has on the order of m points with no hash collisions and the closest point with a hash collision is at a distance on the order of sqrt(m) and this property is not a coincidence but a property of such linear hash functions.

This is of course a heurisitic argument, that only gives approximate values, without any specific coefficients, but here's an interesting specific example. Let's assume that our input points lie inside a circle instead of a square, and then try to make this circle as big as possible by choosing an appropriate value of coefficient k. This means we want the closest lattice point from origin to be as far from it as possible, which is the same as minimizing the length of vectors in reduced basis. By checking all values of k from 0 to 2^32 - 1 with some code, I found that the best value is k = 721003138 which produces reduced basis vectors (-5671, 70194) and (57995, 40006). The lattice is an almost perfect triangular lattice with the angle between basis vectors of 60.0018 degrees.

The radius of the circle is the length of the smallest of basis vectors, which is approximately 70422. To go back from Δx and Δz back to x and z we'll have to divide size of the circle by 2, the same as with square. Then its area will be π(70422/2)^2 ≈ 3.89 * 10^9 which is 90.69% of possible hash values. So, with this k value, we can have a circle of radius 35211 around any point, with any two points inside the circle having different hashes and all hashes covering 90% of all possible hash values. (Also, an interesting problem is to calculate this ratio exactly, when we have a perfect triangular grid and m is big. It works out to be π/(2*sqrt(3))).

Here's also an example of a bad value of k (specifically k = 149943) for which points are distributed quite anisotropically (have different spacings in different directions) and therefore hash collisions in one of the directions will occur more often, compared to when points are distributed isotropically.

So, to sum up, it turns out that with any linear hash function that generates outputs that are sufficiently random (lattice is isotropic enough) we'll have a pretty large region around any input point, that has no hash collisions. Also all hash collisions will occur in a pattern that forms a lattice, which is also pretty interesting.

leijurv commented 6 months ago

very cool!! thanks for the analysis

spawnmason / randar-explanation

Woodland perfect hash #4