kelindar / xxrand

Fast, scalable pseudo random number generator based on xxh3
MIT License
10 stars 1 forks source link

Other fast(er) hash algorithms #1

Open dumblob opened 2 years ago

dumblob commented 2 years ago

Just out of curiosity, let me mention there are faster alternatives to xxh3 or algos with about the same speed but jaw-droppingly simpler. See https://github.com/rurban/smhasher/ .

Namely fastest (for both bulk & short inputs) seems to be wyhash and simpliest from the fastest ones mx3 (btw. I feel mx3 could still have some space for speed optimization as I suspect some false dependencies between instructions in the pipeline... - maybe even good old manual loop unrolling could make a difference, IDK).

But maybe xxh3 is better in some ways which I don't see but would like to. I can imagine something related to DBs but I don't know what exactly :wink:.

kelindar commented 2 years ago

This is a fair point, might try one of these to see if the performance improves. In this specific use-case we are simply hashing 4 byte integer, hence the input is constant and we only need the mixer part. So we're technically not using the full xxh3 and this code might be simpler indeed. In fact, there's no loops to unroll in the first place.

Here's what the library is using now:

x := v ^ (0x1cad21f72c81017c ^ 0xdb979083e96dd4de) + seed
x ^= bits.RotateLeft64(x, 49) ^ bits.RotateLeft64(x, 24)
x *= 0x9fb21c651e98df25
x ^= (x >> 35) + 4
x *= 0x9fb21c651e98df25
x ^= (x >> 28)

Most of the hashes are pretty much the same, I just looked at mx3, we can easily benchmark. Looks quite similar for the use-case.

x ^= x >> 32;
x *= 0xbea225f9eb34556d;
x ^= x >> 29;
x *= 0xbea225f9eb34556d;
x ^= x >> 32;
x *= 0xbea225f9eb34556d;
x ^= x >> 29;

My biggest problem is the atomic counter actually. This instruction doesn't scale with manycore systems so I used RDTSCP to get the counter, along with some fences.

kelindar commented 2 years ago

In fact, I just tried by swapping it with mx3 mixer. Tests pass, but I need to add a proper chi squared test.

Here's results with mx3 hashing:

BenchmarkParallel/rand-001-8            390485431                3.095 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-008-8            383161458                3.243 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-032-8            377720888                3.240 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-128-8            383817609                3.104 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-512-8            389031010                3.138 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-2048-8           377937162                3.139 ns/op           0 B/op          0 allocs/op

And these are with xxh3 hashing:

BenchmarkParallel/rand-001-8            389891288                3.075 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-008-8            339281023                3.272 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-032-8            357304174                3.861 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-128-8            389677950                3.067 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-512-8            378438109                3.175 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-2048-8           364905903                3.083 ns/op           0 B/op          0 allocs/op
dumblob commented 2 years ago

Yep, the speed you measured is something I more or less expected (though xxh3 seems just partially implemented which is kind of an "unfair" advantage speed-wise :wink: - but that's what you said, so nothing bad about it).

But I wonder why do you consider an atomic counter non-scalable? From my humble understanding it should be enough te have just a "pre-seed" as an atomic counter (to have different random values in each thread) but this shouldn't affect the PRNG at all because you can read the process-global pre-seed into thread-local storage into a "true seed" for the PRNG and since then everything works truly in parallel.

But I probably misunderstood what you mean. In which case there are splittable PRNGs such as JAX.

kelindar commented 2 years ago

Hey, I'll have a look into JAX, never heard about it.

For the atomic increment is still a lock, so it would not scale beyond a certain number of cores. Most of the time, you don't care as it's not called that often, but when I was working on my column benchmark, I ran into the issue where atomic random called too often caused a degradation inside my benchmark.

For the thread-local storage, this was actually my initial approach this morning. The issue is that there's no native way to use it in Go, which is a bit annoying. So I started by having an array of counters where n = runtime.NumCPU() but it then each time you'd need a random number, I used a CPUID instruction to get the current core for the goroutine and then get the counter at that index. It worked, but the overhead of that CPUID instruction was pretty huge, around 30ns in my benchmarks. It was expected, since it pulls a lot more than just the id.

To go around that, I decided to try out RDTSC which reads the instruction counter and is pretty fast. As mentioned before, I still want to expose other APIs to the lib where you explicitly provide the counter and seed.

wangyi-fudan commented 2 years ago

you can try wyrand PRNG in wyhash package

dumblob commented 2 years ago

Hey, I'll have a look into JAX, never heard about it.

For the atomic increment is still a lock, so it would not scale beyond a certain number of cores. Most of the time, you don't care as it's not called that often, but when I was working on my column benchmark, I ran into the issue where atomic random called too often caused a degradation inside my benchmark.

Sure, anything "atomic" inside of a non-splittable PRNG will cause serious degradation. I'd try really hard to avoid anything shared.

For the thread-local storage, this was actually my initial approach this morning. The issue is that there's no native way to use it in Go, which is a bit annoying.

This is really bad - I didn't look at Go's support for thread local storage, but if it's not there I'd definitely invest energy (even if it's a lot) into implementing my own tiny constrained support for that (at least for Go's default compiler - not for cgo nor any other for the beginning) using unsafe... Without TLS it's becoming nowadays almost impossible to make scalable algos.