Open fp64 opened 2 years ago
x^=((x+a)*(x+a))&-2;
FWIW: I spent a little bit of time looking at this function WRT strict avalanche criterion measures (more like a spoonful than a grain of salt for these observations). The constant 'a' doesn't seem to justify itself (adding one to the dependency chain). Continued here: https://twitter.com/marc_b_reynolds/status/1576848952846319616
Somewhat lagged reply.
The constant 'a' doesn't seem to justify itself
This matches my experience trying to use it as a step of a stateless PRNG.
Note that x^=((x+a)*(x+a))&-2;
is particularly bad latency-wise, x=((x+a)&-2)^(x*x);
may be ok.
I might as well mention (even though it may be obvious), that the inverse may be computed with:
template<typename T>
T inverse(T (&fn)(T),T x)
{
T ret=0;
for(T bit=1;bit;bit*=2)
ret^=bit*(((fn(ret)^x)&(2*bit-1))!=0);
return ret;
}
(edit: simplified mask)
which should work on any uintN->uintN
bijection where high bits (of input) do not affect low bits (of output), e.g. a mix of conventional and bitwise arithmetic, without right shifts.
I don't know of any "really O(1)" inverse for this.
Oh, whoops, didn't mean to link the issue. But, the x ^= x * x | 1
version of xorsquare at least seems to be one-to-one for all 16-bit inputs, and because it always changes the LSB, there's no case where n maps to n itself. Xorsquare is a really neat find!
Actually, I also settled on XQO form in my PRNGs.
Essentially, once you realize that x^(x*x)
produces every even integer exactly twice, there's a whole lot of ways to "fix" that.
Arguably, the well-known triangular numbers offer a similar idea.
Another trick with XQO: the constant used with OR can be any odd number, not just 1, and you end up with something on the way to ~x
(which is equivalent to x ^= (x * x) | -1;
if negative numbers are two's-complement and such, like how Java and C# have it). This makes me wonder about a constant that changes itself, such as an MCG (since those are always odd). x ^= x * x | (c *= 0xf1357aea2e62a9c5ULL);
uses a 64-bit MCG constant from https://arxiv.org/abs/2001.05304 , which seems fine to me, though I haven't tested this. 0x93d765ddUL
is a 32-bit alternative. The period of the combined construction could be interesting, when used as an RNG state transition -- the MCG has a shorter period than some other options, like an additive counter that adds twice an odd number. The Hamming weight of c
determines how little the square of x
matters to the result. Using x ^= x * x | (c & (c *= 0xf1357aea2e62a9c5ULL));
still guarantees that the ORed value is odd, but reduces its Hamming weight (the ORed value can't ever be -1 with this multiplier, also). I'll tinker a bit with this. Right now my machine I use for testing is running two different long tests (64TB on the CPU and 100+ PB on the GPU), so I'll have to wait to more-thoroughly examine these.
OK, new finding; this isn't useful for a unary hash, but for the related field of pseudo-random number generation, it might be?
state = (state ^ (state * state | o5o7)) * m3;
Where (o5o7 & 7) must equal 5 or 7, and (m3 & 3) must equal 3. This appears to change to all states in a random-seeming order for 16-bit states 32-bit states, and probably 64-bit states, for any o5o7 and any m3 I have tested that meet the mentioned criteria. Like an LCG, it alternates even and odd states, and the low 10 or so bits are very predictable. For 64-bit states, XORing the upper 32 bits with the lower 32 bits (but not storing that in state
) seems to make all bits fairly random.
state = -(state ^ (state * state | o5o7));
is a special case here, since negating is the same as multiplying by -1, and -1 fits the criterion for m3. The most extreme limit is state = -(state ^ -1);
, which is the same as state = state + 1;
, and that's certainly full-period.
The downside to this is that it's much harder computationally to go backwards than forwards with what we know now, though it should be possible. Skipping around in this sequence seems very challenging.
It's the same isn't it? The final integer product peels off into a second logical step.
The difference is the period; repeatedly calling state ^= (state * state) | 1;
will cycle before all possible values are reached, with cycles of different length (e.g. if state is 0 or 1, then the cycle length is 2). The variant with o507 and m3 doesn't have shorter subcycles, which seems unusual. Just multiplying by 3 isn't full-period (it covers a quarter of all possible values), and 7 is even worse (covers an eighth). Multiplying by -1 works fine with the XQO construction and is full-period there, but has a period of 2 if just multiplying by -1 (of course, since it will just flip the sign forever).
Yes. I was just talking about forming the inverse function. That last product is simply reversed by the mod-inverse of m3.
Here's some mixing functions (based on CRC32-C) which are cheapish and strongly mix the bottom 32-bits. https://gist.github.com/Marc-B-Reynolds/7db198a050c422936be4520fb0655a6f
Another trick with XQO: the constant used with OR can be any odd number, not just 1
Well. This also yields much less conspicuous zeroes, e.g. (x^(x*x|1))==0
happens at x==1
, but (x^(x*x|5))==0
happens at x==0x...3EF7226D
(lower bits are the same for all wordsizes).
It only works for x^(x*x|1)
though, not (x|1)^(x*x)
(e.g. (x|5)^(x*x)
is not a bijection). I assumed that form is more latency-friendly (though is it even true?).
Where (o5o7 & 7) must equal 5 or 7, and (m3 & 3) must equal 3.
Interesting. The proof eludes me so far (does anyone know?). I did exhaustively verify that it is both necessary and sufficient up to, and including, uint10_t
.
XORing the upper 32 bits with the lower 32 bits (but not storing that in state) seems to make all bits fairly random.
This does fail PractRand at 16GB (2^32 outputs). For comparison, high^low
of a 64-bit LCG fails at 32MB (2^23 outputs). Still, this probably means that you need fancier output function, at which point it is not obvious if there is any advantage over PCG (e.g. with RS output function the quality seems about the same for this and LCG).
Speaking of PCG, using XQO instead of the first step in pcg_hash
from Hash Functions for GPU Rendering (see also) like this
uint32_t pxq_hash(uint32_t input)
{
uint32_t state = (input | 1u) ^ (input * input);
uint32_t word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
return (word >> 22u) ^ word;
}
seems to improve PractRand results (from failing at 4MB to failing at 512MB).
Both fail gjrand --tiny
badly, though. Both also fail birthday test, being bijections.
Doesn't seem better than triple32
, in both quality and speed. Update: this (and pcg_hash
) appears to be massively worse (avalanche-wise) than even lowbias32
with very evident structure in avalanche matrix.
The (hand-made) function
uint32_t hash(uint32_t x)
{
x ^= x >> 15;
x ^= (x * x) | 1u;
x ^= x >> 17;
x *= 0x9E3779B9u;
x ^= x >> 13;
return x;
}
appears to have better avalanche scores than known murmur3-style two-rounders (but weaker than triple32
), and does decently as a PRNG (according to PractRand).
Function | Linf | RMS | PractRand |
---|---|---|---|
xxHash | 8658392 | 1195645.4 | Fails at 16KB |
MurmurHash3 | 4044016 | 566904.4 | Fails at 16KB |
lowbias32 | 2023971 | 372660.4 | Fails at 32KB |
lowbias32_tib | 1211488 | 231074.2 | Fails at 64KB |
this | 836260 | 121867.6 | Fails at 1GB |
triple32 | 167788 | 44857.8 | Fails at 1GB ('mildly suspicious' at 8KB) |
lowbias32_tib
is TheIronBorn's version.
Avalanche scores use unscaled abs(counts[i][j]-(1ul<<31))
. Looks like RMS is 2^31/1000 times what prospector -E
reports.
PractRand is the result of while(1) output(hash(i++));
piped to RNG_test stdin32 -tlmin 1K -te 0 -tf 2
(you may want -te 1
instead).
Would be nice if someone double-checks this.
Might be worthwhile to prospect around a bit for functions of this form.
Update: https://www.shadertoy.com/view/dllSW7 .
I assumed that form is more latency-friendly
It may be, slightly: https://godbolt.org/z/5jznWeoW5 .
So you might be better off writing it as
uint32_t hash(uint32_t x)
{
x ^= x >> 15;
x = (x | 1u) ^ (x * x);
x ^= x >> 17;
x *= 0x9E3779B9u;
x ^= x >> 13;
return x;
}
which gives identical results.
Might be worth comparing some of these to those on https://mostlymangling.blogspot.com/ which uses a similar PractRand technique
Unless I'm missing something (and I well may), measuring things becomes somewhat harder in 64 bit. E.g. I just spent a number of minutes, running Monte-Carlo (exhaustive is not practical for 64 bit) avalanche calculation, and got... this: Function | Emax | Erms |
---|---|---|
hash64 | 1.967±0.406 | 0.503±0.011 |
murmur64 | 1.841±0.378 | 0.498±0.015 |
lea64 | 1.869±0.298 | 0.499±0.011 |
degski64 | 1.812±0.259 | 0.500±0.017 |
nasam | 1.863±0.373 | 0.501±0.016 |
moremur | 1.924±0.467 | 0.503±0.019 |
ettingerMixer | 490.445±0.210 | 53.035±0.044 |
rrmxmx | 1.856±0.239 | 0.499±0.019 |
rrxmrrxmsx_0 | 1.895±0.415 | 0.498±0.010 |
Not very precise (assuming I haven't botched it in the first place)... Notes:
K=8
runs of N=10^6
samples for each function. The "X ± Y" are simply average and 3*σ of the resulting 8 values (edit: without Bessel's correction).sqrt(N)
.ettingerMixer
.hash64
is an adaptation ([31,35,⌊2^64/φ⌋,27]
, probably not actually good) of the above function to 64 bit.Similarly with PractRand. For 32 bit 32GB is enough, but e.g. RRC-64-42-TF2-0.94 is 4TB, and some of the mentioned tests are up to 128TB.
Someone with access to more computing power might want to give it a try.
I (Tommy Ettinger) also wonder what's up with ettingerMixer, but I wouldn't be surprised if it's just bad when measured with specific tests. @fp64 , can you post or link the source of the ettingerMixer you used? It might give us some insight into what makes generators fail, since not many seem to do that right now.
From https://mostlymangling.blogspot.com/2019/01/better-stronger-mixer-and-test-procedure.html (except I changed suffixes to ull
):
uint64_t ettingerMixer(uint64_t v) {
uint64_t z = (v ^ 0xDB4F0B9175AE2165ull) * 0x4823A80B2006E21Bull;
z ^= rol64(z, 52) ^ rol64(z, 21) ^ 0x9E3779B97F4A7C15ull;
z *= 0x81383173ull;
return z ^ z >> 28;
}
Update: heatmap (scaled to 0..9):
Not very precise
Some more context on accuracy of Monte Carlo.
As we can see, somewhat good functions look basically indistinguishable with Monte Carlo for fairly sizeable N. But also Monte Carlo can detect more sharp deviations a lot earlier.
Speaking of measuring things exactly, some optimal in their class uint8_t->uint8_t
hash functions look like this:
Type | Constants | Linf | RMS | Notes |
---|---|---|---|---|
xorr,mul,xorr,mul,xorr | [ 1,0x0F, 3,0xDD, 4] | 16 | 7.3654599 | Optimal for both, [ 1,0F, 3,0x5D, 4] has same Linf |
xorr,xqo,xorr,mul,xorr | [ 2, 5,0x93, 5] | 20 | 8.9442719 | Optimal for both, [ 5, 4,0x9F, 2] and [ 3, 6,0x75, 1] have the same Linf |
xorr,mul,xorr,xqo,xorr | [ 5,0x7B, 4, 4] | 20 | 11.1691540 | Linf-optimal, [ 4,0x43, 3, 4],[ 4,0xC3, 3, 4] have the same Linf |
xorr,mul,xorr,xqo,xorr | [ 3,0x3B, 4, 4] | 24 | 9.4339811 | RMS-optimal |
xorr,mul,xorr,xqo,xorr | [ 3,0xBB, 4, 4] | 24 | 9.4339811 | RMS-optimal |
xorr,xqo,xorr,xqo,xorr | [ 4, 5, 2] | 24 | 11.8532696 | Linf-optimal |
xorr,xqo,xorr,xqo,xorr | [ 2, 5, 4] | 32 | 10.5237826 | RMS-optimal |
The function
uint32_t hash(uint32_t x)
{
x ^= x >> 23;
x = (x | 1u) ^ (x * x);
x ^= x >> 13;
x = (x | 1u) ^ (x * x);
x ^= x >> 9;
x = (x | 1u) ^ (x * x);
x ^= x >> 17;
return x;
}
appears to produce better avalanche scores than triple32
(actually there's a lot of functions with better Linf scores, like the one below, but lower RMS is harder to obtain).
Function | Linf | RMS | PractRand |
---|---|---|---|
triple32 | 167788 | 44857.8 | Fails at 1GB ('mildly suspicious' at 8KB) |
[23,13, 9,17] | 158752 | 44181.3 | Fails at 1GB |
[14,14,14,14] | 133756 | 46713.3 | Fails at 1GB |
The [14,14,14,14] one is easy to memorize.
@tommyettinger, I just noticed something curious. As mentioned before, 32-bit rng
uint32_t rng32()
{
uint64_t state=0ull;
state=-(state^(state*state|5ull));
return (uint32_t)(state>>32);
}
performs very smoothly in PractRand, up until it abruptly harshly fails at 16GB. Similar for 16-bit rng, which fails at 512KB. This makes sense, since the period of k-th bit is 2^k (tests fail close to the period of lowest bit of output). However, 64-bit version
uint64_t rng64()
{
static __uint128_t state=0ull;
state=-(state^(state*state|5ull));
return (uint64_t)(state>>64);
}
seems to fail very fast (but not immediately!).
rng=RNG_stdin64, seed=unknown
length= 1 kilobyte (2^10 bytes), time= 0.5 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +5.7 p = 0.019 unusual
...and 5 test result(s) without anomalies
rng=RNG_stdin64, seed=unknown
length= 2 kilobytes (2^11 bytes), time= 0.7 seconds
Test Name Raw Processed Evaluation
DC6-9x1Bytes-1 R= +6.6 p = 8.5e-3 unusual
...and 7 test result(s) without anomalies
rng=RNG_stdin64, seed=unknown
length= 4 kilobytes (2^12 bytes), time= 1.0 seconds
Test Name Raw Processed Evaluation
Gap-16:A R= +15.4 p = 1.6e-13 FAIL
Gap-16:B R= +16.2 p = 1.0e-13 FAIL
...and 22 test result(s) without anomalies
Seeding the state with something like -1ull/5
makes it ok for at least couple of gigabytes. Some small seeds yield "suspicious" early on, but proceed smoothly afterwards.
Returning (uint64_t)(state^(state>>64))
produces "mildly suspicious" at 2KB, and few "unusual" later, but passes for at least a few gigabytes (for 32-bit version there seems to be not much difference between returning state>>32
and state^(state>>32)
).
Note: the actual code didn't use __uint128_t
, emulating it with uint64_t
s, but the results should match.
The "xorsquare" function (which I briefly describe here) is of the form:
for any integer
a
.Note: alternative version (
x^=((x+a)*(x+a))&-2;
) produces identical results for values ofa
that differ in the most significant bit.It is straightforward to show (e.g. by induction on wordsize) that it is inversible.