Consider adding another inversible function

fp64 commented 2 years ago

The "xorsquare" function (which I briefly describe here) is of the form:

x=((x+a)&-2)^(x*x);

for any integer a.

Note: alternative version (x^=((x+a)*(x+a))&-2;) produces identical results for values of a that differ in the most significant bit.

It is straightforward to show (e.g. by induction on wordsize) that it is inversible.

Marc-B-Reynolds commented 2 years ago

x^=((x+a)*(x+a))&-2;

FWIW: I spent a little bit of time looking at this function WRT strict avalanche criterion measures (more like a spoonful than a grain of salt for these observations). The constant 'a' doesn't seem to justify itself (adding one to the dependency chain). Continued here: https://twitter.com/marc_b_reynolds/status/1576848952846319616

fp64 commented 2 years ago

Somewhat lagged reply.

The constant 'a' doesn't seem to justify itself

This matches my experience trying to use it as a step of a stateless PRNG.

Note that x^=((x+a)*(x+a))&-2; is particularly bad latency-wise, x=((x+a)&-2)^(x*x); may be ok.

I might as well mention (even though it may be obvious), that the inverse may be computed with:

template<typename T>
T inverse(T (&fn)(T),T x)
{
    T ret=0;
    for(T bit=1;bit;bit*=2)
        ret^=bit*(((fn(ret)^x)&(2*bit-1))!=0);
    return ret;
}

(edit: simplified mask) which should work on any uintN->uintN bijection where high bits (of input) do not affect low bits (of output), e.g. a mix of conventional and bitwise arithmetic, without right shifts.

I don't know of any "really O(1)" inverse for this.

tommyettinger commented 1 year ago

Oh, whoops, didn't mean to link the issue. But, the x ^= x * x | 1 version of xorsquare at least seems to be one-to-one for all 16-bit inputs, and because it always changes the LSB, there's no case where n maps to n itself. Xorsquare is a really neat find!

fp64 commented 1 year ago

Actually, I also settled on XQO form in my PRNGs.

Essentially, once you realize that x^(x*x) produces every even integer exactly twice, there's a whole lot of ways to "fix" that.

Arguably, the well-known triangular numbers offer a similar idea.

tommyettinger commented 1 year ago

Another trick with XQO: the constant used with OR can be any odd number, not just 1, and you end up with something on the way to ~x (which is equivalent to x ^= (x * x) | -1; if negative numbers are two's-complement and such, like how Java and C# have it). This makes me wonder about a constant that changes itself, such as an MCG (since those are always odd). x ^= x * x | (c *= 0xf1357aea2e62a9c5ULL); uses a 64-bit MCG constant from https://arxiv.org/abs/2001.05304 , which seems fine to me, though I haven't tested this. 0x93d765ddUL is a 32-bit alternative. The period of the combined construction could be interesting, when used as an RNG state transition -- the MCG has a shorter period than some other options, like an additive counter that adds twice an odd number. The Hamming weight of c determines how little the square of x matters to the result. Using x ^= x * x | (c & (c *= 0xf1357aea2e62a9c5ULL)); still guarantees that the ORed value is odd, but reduces its Hamming weight (the ORed value can't ever be -1 with this multiplier, also). I'll tinker a bit with this. Right now my machine I use for testing is running two different long tests (64TB on the CPU and 100+ PB on the GPU), so I'll have to wait to more-thoroughly examine these.

tommyettinger commented 1 year ago

OK, new finding; this isn't useful for a unary hash, but for the related field of pseudo-random number generation, it might be?

state = (state ^ (state * state | o5o7)) * m3;

Where (o5o7 & 7) must equal 5 or 7, and (m3 & 3) must equal 3. This appears to change to all states in a random-seeming order for 16-bit states 32-bit states, and probably 64-bit states, for any o5o7 and any m3 I have tested that meet the mentioned criteria. Like an LCG, it alternates even and odd states, and the low 10 or so bits are very predictable. For 64-bit states, XORing the upper 32 bits with the lower 32 bits (but not storing that in state) seems to make all bits fairly random.

state = -(state ^ (state * state | o5o7)); is a special case here, since negating is the same as multiplying by -1, and -1 fits the criterion for m3. The most extreme limit is state = -(state ^ -1);, which is the same as state = state + 1;, and that's certainly full-period.

The downside to this is that it's much harder computationally to go backwards than forwards with what we know now, though it should be possible. Skipping around in this sequence seems very challenging.

Marc-B-Reynolds commented 1 year ago

It's the same isn't it? The final integer product peels off into a second logical step.

tommyettinger commented 1 year ago

The difference is the period; repeatedly calling state ^= (state * state) | 1; will cycle before all possible values are reached, with cycles of different length (e.g. if state is 0 or 1, then the cycle length is 2). The variant with o507 and m3 doesn't have shorter subcycles, which seems unusual. Just multiplying by 3 isn't full-period (it covers a quarter of all possible values), and 7 is even worse (covers an eighth). Multiplying by -1 works fine with the XQO construction and is full-period there, but has a period of 2 if just multiplying by -1 (of course, since it will just flip the sign forever).

Marc-B-Reynolds commented 1 year ago

Yes. I was just talking about forming the inverse function. That last product is simply reversed by the mod-inverse of m3.

Marc-B-Reynolds commented 1 year ago

Here's some mixing functions (based on CRC32-C) which are cheapish and strongly mix the bottom 32-bits. https://gist.github.com/Marc-B-Reynolds/7db198a050c422936be4520fb0655a6f

fp64 commented 1 year ago

Another trick with XQO: the constant used with OR can be any odd number, not just 1

Well. This also yields much less conspicuous zeroes, e.g. (x^(x*x|1))==0 happens at x==1, but (x^(x*x|5))==0 happens at x==0x...3EF7226D (lower bits are the same for all wordsizes). It only works for x^(x*x|1) though, not (x|1)^(x*x) (e.g. (x|5)^(x*x) is not a bijection). I assumed that form is more latency-friendly (though is it even true?).

Where (o5o7 & 7) must equal 5 or 7, and (m3 & 3) must equal 3.

Interesting. The proof eludes me so far (does anyone know?). I did exhaustively verify that it is both necessary and sufficient up to, and including, uint10_t.

XORing the upper 32 bits with the lower 32 bits (but not storing that in state) seems to make all bits fairly random.

This does fail PractRand at 16GB (2^32 outputs). For comparison, high^low of a 64-bit LCG fails at 32MB (2^23 outputs). Still, this probably means that you need fancier output function, at which point it is not obvious if there is any advantage over PCG (e.g. with RS output function the quality seems about the same for this and LCG).

Speaking of PCG, using XQO instead of the first step in pcg_hash from Hash Functions for GPU Rendering (see also) like this

uint32_t pxq_hash(uint32_t input)
{
    uint32_t state = (input | 1u) ^ (input * input);
    uint32_t word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
    return (word >> 22u) ^ word;
}

seems to improve PractRand results (from failing at 4MB to failing at 512MB).

Both fail gjrand --tiny badly, though. Both also fail birthday test, being bijections.

Doesn't seem better than triple32, in both quality and speed. Update: this (and pcg_hash) appears to be massively worse (avalanche-wise) than even lowbias32 with very evident structure in avalanche matrix.

fp64 commented 1 year ago

The (hand-made) function

uint32_t hash(uint32_t x)
{
    x ^= x >> 15;
    x ^= (x * x) | 1u;
    x ^= x >> 17;
    x *= 0x9E3779B9u;
    x ^= x >> 13;
    return x;
}

appears to have better avalanche scores than known murmur3-style two-rounders (but weaker than triple32), and does decently as a PRNG (according to PractRand).

Function	Linf	RMS	PractRand
xxHash	8658392	1195645.4	Fails at 16KB
MurmurHash3	4044016	566904.4	Fails at 16KB
lowbias32	2023971	372660.4	Fails at 32KB
lowbias32_tib	1211488	231074.2	Fails at 64KB
this	836260	121867.6	Fails at 1GB
triple32	167788	44857.8	Fails at 1GB ('mildly suspicious' at 8KB)

lowbias32_tib is TheIronBorn's version.

Avalanche scores use unscaled abs(counts[i][j]-(1ul<<31)). Looks like RMS is 2^31/1000 times what prospector -E reports.

PractRand is the result of while(1) output(hash(i++)); piped to RNG_test stdin32 -tlmin 1K -te 0 -tf 2 (you may want -te 1 instead).

Would be nice if someone double-checks this.

Might be worthwhile to prospect around a bit for functions of this form.

Update: https://www.shadertoy.com/view/dllSW7 .

fp64 commented 1 year ago

I assumed that form is more latency-friendly

It may be, slightly: https://godbolt.org/z/5jznWeoW5 .

So you might be better off writing it as

uint32_t hash(uint32_t x)
{
    x ^= x >> 15;
    x = (x | 1u) ^ (x * x);
    x ^= x >> 17;
    x *= 0x9E3779B9u;
    x ^= x >> 13;
    return x;
}

which gives identical results.

TheIronBorn commented 1 year ago

Might be worth comparing some of these to those on https://mostlymangling.blogspot.com/ which uses a similar PractRand technique

fp64 commented 1 year ago

https://mostlymangling.blogspot.com/

Unless I'm missing something (and I well may), measuring things becomes somewhat harder in 64 bit. E.g. I just spent a number of minutes, running Monte-Carlo (exhaustive is not practical for 64 bit) avalanche calculation, and got... this: Function	Emax	Erms
hash64	1.967±0.406	0.503±0.011
murmur64	1.841±0.378	0.498±0.015
lea64	1.869±0.298	0.499±0.011
degski64	1.812±0.259	0.500±0.017
nasam	1.863±0.373	0.501±0.016
moremur	1.924±0.467	0.503±0.019
ettingerMixer	490.445±0.210	53.035±0.044
rrmxmx	1.856±0.239	0.499±0.019
rrxmrrxmsx_0	1.895±0.415	0.498±0.010

Not very precise (assuming I haven't botched it in the first place)... Notes:

Test uses K=8 runs of N=10^6 samples for each function. The "X ± Y" are simply average and 3*σ of the resulting 8 values (edit: without Bessel's correction).
Avalanche scores are "normalized", i.e. divided by sqrt(N).
Not sure what's up with ettingerMixer.
hash64 is an adaptation ([31,35,⌊2^64/φ⌋,27], probably not actually good) of the above function to 64 bit.

Similarly with PractRand. For 32 bit 32GB is enough, but e.g. RRC-64-42-TF2-0.94 is 4TB, and some of the mentioned tests are up to 128TB.

Someone with access to more computing power might want to give it a try.

tommyettinger commented 1 year ago

I (Tommy Ettinger) also wonder what's up with ettingerMixer, but I wouldn't be surprised if it's just bad when measured with specific tests. @fp64 , can you post or link the source of the ettingerMixer you used? It might give us some insight into what makes generators fail, since not many seem to do that right now.

fp64 commented 1 year ago

From https://mostlymangling.blogspot.com/2019/01/better-stronger-mixer-and-test-procedure.html (except I changed suffixes to ull):

uint64_t ettingerMixer(uint64_t v) {
  uint64_t z = (v ^ 0xDB4F0B9175AE2165ull) * 0x4823A80B2006E21Bull;
  z ^= rol64(z, 52) ^ rol64(z, 21) ^ 0x9E3779B97F4A7C15ull;
  z *= 0x81383173ull;
  return z ^ z >> 28;
}

Update: heatmap (scaled to 0..9):

64x64

``` 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000001000000000000000000000000000 0000000000000000000000000000000000000200000000100000000000000000 0000000000000000000000000000000000000020000000010000000000000000 0100000000000000000000000000000000000002010000001000000000000000 0010000000000000000000000000000000000000201000000100000000000000 0001000001000000000000000000000000000100020100000010000000000000 0010100100000000000000000000000000000000002010000001000000000000 1001011020010000000000000000000000002001020202010000100000000000 0000101102001000000000000000000000000200102020201000010000000000 1010020210200100000000000000000000001020010202020100001010000000 0101003021020011000000000000000000002102001131402010000101001010 0021202502102111102000000100001000010210211113680201021020100201 1102121360210211100200000010000000006021021110578020102102111020 1230302368021030204021000001001010006802103020478902010230212302 2047030478802603015702000030100102007880260301578990214046020470 1426815267890571524680420115022130326789057152468899052625714368 ```

fp64 commented 1 year ago

Not very precise

Some more context on accuracy of Monte Carlo.

Show data

Let's take 32-bit function above (so we can actually do exhaustive measurement): Method | N | Linf | Linf/sqrt(N) | RMS | RMS/sqrt(N) -----------|---------:|--------:|--------------:|---------:|------------: Monte Carlo| 1000| 57| 1.802 | 15.8| 0.501 Monte Carlo| 2000| 71| 1.588 | 22.3| 0.498 Monte Carlo| 4000| 94| 1.486 | 31.6| 0.500 Monte Carlo| 8000| 142| 1.588 | 42.9| 0.479 Monte Carlo| 16000| 250| 1.976 | 62.0| 0.490 Monte Carlo| 32000| 262| 1.465 | 89.0| 0.497 Monte Carlo| 64000| 392| 1.550 | 125.6| 0.496 Monte Carlo| 128000| 540| 1.509 | 175.8| 0.491 Monte Carlo| 256000| 849| 1.678 | 254.3| 0.503 Monte Carlo| 512000| 1118| 1.562 | 346.5| 0.484 Monte Carlo| 1024000| 2015| 1.991 | 528.5| 0.522 Monte Carlo| 2048000| 2360| 1.649 | 702.6| 0.491 Monte Carlo| 4096000| 3475| 1.717 | 995.9| 0.492 Monte Carlo| 8192000| 5122| 1.790 | 1436.0| 0.502 Monte Carlo| 16384000| 8038| 1.986 | 1915.1| 0.473 Monte Carlo| 32768000| 11206| 1.958 | 2997.3| 0.524 Monte Carlo| 65536000| 15361| 1.897 | 4277.6| 0.528 Monte Carlo| 131072000| 26863| 2.346 | 6721.9| 0.587 Monte Carlo| 262144000| 58485| 3.612 | 10895.5| 0.673 Monte Carlo| 524288000| 110290| 4.817 | 18779.9| 0.820 Monte Carlo|1048576000| 203025| 6.270 | 34348.4| 1.061 Monte Carlo|2097152000| 422054| 9.216 | 63478.4| 1.386 Monte Carlo|4294967296| 829977| 12.664 | 126039.3| 1.923 Exhausitve |4294967296| 836260| 12.760 | 121867.6| 1.860 Compare to `MurmurHash3`: Method | N | Linf | Linf/sqrt(N) | RMS | RMS/sqrt(N) -----------|---------:|--------:|--------------:|---------:|------------: Monte Carlo| 1000| 54| 1.708 | 16.3| 0.515 Monte Carlo| 2000| 71| 1.588 | 21.7| 0.486 Monte Carlo| 4000| 128| 2.024 | 31.4| 0.497 Monte Carlo| 8000| 143| 1.599 | 44.4| 0.497 Monte Carlo| 16000| 233| 1.842 | 62.1| 0.491 Monte Carlo| 32000| 347| 1.940 | 94.3| 0.527 Monte Carlo| 64000| 415| 1.640 | 128.6| 0.508 Monte Carlo| 128000| 626| 1.750 | 178.6| 0.499 Monte Carlo| 256000| 906| 1.791 | 259.7| 0.513 Monte Carlo| 512000| 1146| 1.602 | 359.7| 0.503 Monte Carlo| 1024000| 1772| 1.751 | 529.0| 0.523 Exhausitve |4294967296| 4044016| 61.707 | 566904.4| 8.650 And to `pcg_hash`: Method | N | Linf | Linf/sqrt(N) | RMS | RMS/sqrt(N) -----------|---------:|--------:|--------------:|---------:|------------: Monte Carlo| 1000| 89| 2.814 | 16.5| 0.523 Monte Carlo| 2000| 156| 3.488 | 23.8| 0.533 Monte Carlo| 4000| 289| 4.569 | 35.5| 0.561 Monte Carlo| 8000| 598| 6.686 | 55.2| 0.617 Monte Carlo| 16000| 1209| 9.558 | 85.9| 0.679 Monte Carlo| 32000| 2344| 13.103 | 148.4| 0.830 Monte Carlo| 64000| 4775| 18.875 | 281.9| 1.114 Monte Carlo| 128000| 10078| 28.169 | 528.2| 1.476 Monte Carlo| 256000| 20025| 39.578 | 1030.4| 2.037 Monte Carlo| 512000| 39575| 55.308 | 2028.7| 2.835 Monte Carlo| 1024000| 78691| 77.763 | 4003.0| 3.956 Exhausitve |4294967296|331871348|75063.955 |16645540.6|253.991

As we can see, somewhat good functions look basically indistinguishable with Monte Carlo for fairly sizeable N. But also Monte Carlo can detect more sharp deviations a lot earlier.

fp64 commented 1 year ago

Speaking of measuring things exactly, some optimal in their class uint8_t->uint8_t hash functions look like this:

Type	Constants	Linf	RMS	Notes
xorr,mul,xorr,mul,xorr	[ 1,0x0F, 3,0xDD, 4]	16	7.3654599	Optimal for both, [ 1,0F, 3,0x5D, 4] has same Linf
xorr,xqo,xorr,mul,xorr	[ 2, 5,0x93, 5]	20	8.9442719	Optimal for both, [ 5, 4,0x9F, 2] and [ 3, 6,0x75, 1] have the same Linf
xorr,mul,xorr,xqo,xorr	[ 5,0x7B, 4, 4]	20	11.1691540	Linf-optimal, [ 4,0x43, 3, 4],[ 4,0xC3, 3, 4] have the same Linf
xorr,mul,xorr,xqo,xorr	[ 3,0x3B, 4, 4]	24	9.4339811	RMS-optimal
xorr,mul,xorr,xqo,xorr	[ 3,0xBB, 4, 4]	24	9.4339811	RMS-optimal
xorr,xqo,xorr,xqo,xorr	[ 4, 5, 2]	24	11.8532696	Linf-optimal
xorr,xqo,xorr,xqo,xorr	[ 2, 5, 4]	32	10.5237826	RMS-optimal

fp64 commented 1 year ago

The function

uint32_t hash(uint32_t x)
{
    x ^= x >> 23;
    x = (x | 1u) ^ (x * x);
    x ^= x >> 13;
    x = (x | 1u) ^ (x * x);
    x ^= x >>  9;
    x = (x | 1u) ^ (x * x);
    x ^= x >> 17;
    return x;
}

appears to produce better avalanche scores than triple32 (actually there's a lot of functions with better Linf scores, like the one below, but lower RMS is harder to obtain).

Function	Linf	RMS	PractRand
triple32	167788	44857.8	Fails at 1GB ('mildly suspicious' at 8KB)
[23,13, 9,17]	158752	44181.3	Fails at 1GB
[14,14,14,14]	133756	46713.3	Fails at 1GB

The [14,14,14,14] one is easy to memorize.

fp64 commented 1 year ago

@tommyettinger, I just noticed something curious. As mentioned before, 32-bit rng

uint32_t rng32()
{
    uint64_t state=0ull;
    state=-(state^(state*state|5ull));
    return (uint32_t)(state>>32);
}

performs very smoothly in PractRand, up until it abruptly harshly fails at 16GB. Similar for 16-bit rng, which fails at 512KB. This makes sense, since the period of k-th bit is 2^k (tests fail close to the period of lowest bit of output). However, 64-bit version

uint64_t rng64()
{
    static __uint128_t state=0ull;
    state=-(state^(state*state|5ull));
    return (uint64_t)(state>>64);
}

seems to fail very fast (but not immediately!).

rng=RNG_stdin64, seed=unknown
length= 1 kilobyte (2^10 bytes), time= 0.5 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R=  +5.7  p = 0.019     unusual          
  ...and 5 test result(s) without anomalies

rng=RNG_stdin64, seed=unknown
length= 2 kilobytes (2^11 bytes), time= 0.7 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R=  +6.6  p =  8.5e-3   unusual          
  ...and 7 test result(s) without anomalies

rng=RNG_stdin64, seed=unknown
length= 4 kilobytes (2^12 bytes), time= 1.0 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R= +15.4  p =  1.6e-13    FAIL           
  Gap-16:B                          R= +16.2  p =  1.0e-13    FAIL           
  ...and 22 test result(s) without anomalies

Seeding the state with something like -1ull/5 makes it ok for at least couple of gigabytes. Some small seeds yield "suspicious" early on, but proceed smoothly afterwards. Returning (uint64_t)(state^(state>>64)) produces "mildly suspicious" at 2KB, and few "unusual" later, but passes for at least a few gigabytes (for 32-bit version there seems to be not much difference between returning state>>32 and state^(state>>32)). Note: the actual code didn't use __uint128_t, emulating it with uint64_ts, but the results should match.

skeeto / hash-prospector

Consider adding another inversible function #23