Open Kubuxu opened 6 months ago
Hey @leijurv. Here's an explanation I came up with, why the mansion seed's lower 32 bits happen to have no collisions, and it turns out it involves some lattices again!
The seed is calculated as
seed = x * 341873128712 + z * 132897987541 - 4172144997891902323 mod 2^48
or introducing some notation:
seed = ax + bz + c mod 2^48
We're looking at the lower 32 bits, i.e. seed mod 2^32 = ax + bz + c mod 2^48 mod 2^32
and because 2^48 is divisible by 2^32 we can drop the inner mod and get
hash(x, z) = ax + bz + c mod m, m = 2^32
We want to check if there are any collisions:
hash(x1, z1) = hash(x2, z2)
ax1 + bz1 + c = ax2 + bz2 + c mod m
a(x1 - x2) + b(z1 - z2) = 0 mod m
aΔx + bΔz = 0 mod m
where Δx = x1 - x2
, Δz = z1 - z2
. Now we can use the fact, that b is invertible mod m:
Δz = -ab^(-1)Δx mod m
or introducing k = -ab^(-1) mod m
, which is k = 1675244312
for our case:
Δz = kΔx mod m (1)
So, two points (x1, z1)
and (x2, z2)
have the same hash if and only if differences in their coordinates satisfy equation (1).
Now, because (1) is a simple linear relation, if we plot points, that satisfy it, we get a lattice! It's basis vectors are (1, k)
and (0, m)
. After performing LLL reduction for our case we get a reduced basis (27794, -48208)
and (56211, 57032)
.
Now all we have to do, is to check that no pair of our input points have differences of coordinates Δx
and Δz
that are contained in this lattice. Our input points are contained in a square -23440 < x < 23440, -23440 < z < 23440
. That means Δx
and Δz
lie in a square that is twice as big: -46880 < Δx < 46880, -46880 < Δz < 46880
(to understand why, imagine two input points on opposite sides of the square, and what their coordinate differences would be). Plotting this square together with a lattice we can see, that it indeed doesn't overlap with any lattice points (except (0, 0)
, but it's not a hash collision, because then x1 = x2, z1 = z2
, so it's the same point):
Red arrows are the reduced basis vectors.
Now let's look into the question of whether this is a rare situation, that is caused by a specific selection of coefficients a
and b
or would it work with any coefficients. On the first glance, it may seem strange that points in our square generate 46880^2 = 2197734400
unique hashes, that is more than half of possible 2^32
values. But let's estimate the average distance between points in the lattice.
If we change Δx
from 0
to m - 1
, and calculate Δz
from equation (1), then Δz
will also change in the range [0, m - 1]
and we should thus have m lattice points that fill a square 0 <= Δx < m, 0 <= Δz < m
. It seems logical to assume, that if our hash function is supposed to look random, then lattice points would be distributed almost evenly in all directions (and also length of both reduced basis vectors will be close to each other). If m
points are distributed evenly inside a square of side m
, which has an area A = m^2
, that means area per point is Ap = A / m = m
and the distance between neigbouring points would be on the order of sqrt(Ap) = sqrt(m)
. This means that there should be an area free of any lattice point around origin with the size on the order of Ap = m
. In other words, around any point (x, z)
there's an area that has on the order of m
points with no hash collisions and the closest point with a hash collision is at a distance on the order of sqrt(m)
and this property is not a coincidence but a property of such linear hash functions.
This is of course a heurisitic argument, that only gives approximate values, without any specific coefficients, but here's an interesting specific example. Let's assume that our input points lie inside a circle instead of a square, and then try to make this circle as big as possible by choosing an appropriate value of coefficient k
. This means we want the closest lattice point from origin to be as far from it as possible, which is the same as minimizing the length of vectors in reduced basis. By checking all values of k
from 0 to 2^32 - 1 with some code, I found that the best value is k = 721003138
which produces reduced basis vectors (-5671, 70194)
and (57995, 40006)
. The lattice is an almost perfect triangular lattice with the angle between basis vectors of 60.0018
degrees.
The radius of the circle is the length of the smallest of basis vectors, which is approximately 70422. To go back from Δx
and Δz
back to x
and z
we'll have to divide size of the circle by 2, the same as with square. Then its area will be π(70422/2)^2 ≈ 3.89 * 10^9
which is 90.69%
of possible hash values. So, with this k value, we can have a circle of radius 35211 around any point, with any two points inside the circle having different hashes and all hashes covering 90% of all possible hash values. (Also, an interesting problem is to calculate this ratio exactly, when we have a perfect triangular grid and m is big. It works out to be π/(2*sqrt(3))
).
Here's also an example of a bad value of k (specifically k = 149943
) for which points are distributed quite anisotropically (have different spacings in different directions) and therefore hash collisions in one of the directions will occur more often, compared to when points are distributed isotropically.
So, to sum up, it turns out that with any linear hash function that generates outputs that are sufficiently random (lattice is isotropic enough) we'll have a pretty large region around any input point, that has no hash collisions. Also all hash collisions will occur in a pattern that forms a lattice, which is also pretty interesting.
very cool!! thanks for the analysis
To shine some light on the topic. The woodland seed function is a perfect hash of the cords (x,z) in the limited range. It is from the family of linear hash functions. Usual construction would require that parameters are co-prime to modulus and each-other but due to limited range of x and z the function is still injective despite one of the parameters not being co-prime with the modulus.
In case of only one input, if the parameter is co-prime to the modulus, it will create an additive group of size N, meaning it it will create a perfect hash for all values [0, N). (I can write more on it latter).