Open dumblob opened 2 years ago
This is a fair point, might try one of these to see if the performance improves. In this specific use-case we are simply hashing 4 byte integer, hence the input is constant and we only need the mixer part. So we're technically not using the full xxh3 and this code might be simpler indeed. In fact, there's no loops to unroll in the first place.
Here's what the library is using now:
x := v ^ (0x1cad21f72c81017c ^ 0xdb979083e96dd4de) + seed
x ^= bits.RotateLeft64(x, 49) ^ bits.RotateLeft64(x, 24)
x *= 0x9fb21c651e98df25
x ^= (x >> 35) + 4
x *= 0x9fb21c651e98df25
x ^= (x >> 28)
Most of the hashes are pretty much the same, I just looked at mx3
, we can easily benchmark. Looks quite similar for the use-case.
x ^= x >> 32;
x *= 0xbea225f9eb34556d;
x ^= x >> 29;
x *= 0xbea225f9eb34556d;
x ^= x >> 32;
x *= 0xbea225f9eb34556d;
x ^= x >> 29;
My biggest problem is the atomic counter actually. This instruction doesn't scale with manycore systems so I used RDTSCP
to get the counter, along with some fences.
In fact, I just tried by swapping it with mx3
mixer. Tests pass, but I need to add a proper chi squared test.
Here's results with mx3
hashing:
BenchmarkParallel/rand-001-8 390485431 3.095 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-008-8 383161458 3.243 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-032-8 377720888 3.240 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-128-8 383817609 3.104 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-512-8 389031010 3.138 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-2048-8 377937162 3.139 ns/op 0 B/op 0 allocs/op
And these are with xxh3
hashing:
BenchmarkParallel/rand-001-8 389891288 3.075 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-008-8 339281023 3.272 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-032-8 357304174 3.861 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-128-8 389677950 3.067 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-512-8 378438109 3.175 ns/op 0 B/op 0 allocs/op
BenchmarkParallel/rand-2048-8 364905903 3.083 ns/op 0 B/op 0 allocs/op
Yep, the speed you measured is something I more or less expected (though xxh3 seems just partially implemented which is kind of an "unfair" advantage speed-wise :wink: - but that's what you said, so nothing bad about it).
But I wonder why do you consider an atomic counter non-scalable? From my humble understanding it should be enough te have just a "pre-seed" as an atomic counter (to have different random values in each thread) but this shouldn't affect the PRNG at all because you can read the process-global pre-seed into thread-local storage into a "true seed" for the PRNG and since then everything works truly in parallel.
But I probably misunderstood what you mean. In which case there are splittable PRNGs such as JAX.
Hey, I'll have a look into JAX, never heard about it.
For the atomic increment is still a lock, so it would not scale beyond a certain number of cores. Most of the time, you don't care as it's not called that often, but when I was working on my column
benchmark, I ran into the issue where atomic random called too often caused a degradation inside my benchmark.
For the thread-local storage, this was actually my initial approach this morning. The issue is that there's no native way to use it in Go, which is a bit annoying. So I started by having an array of counters where n = runtime.NumCPU()
but it then each time you'd need a random number, I used a CPUID
instruction to get the current core for the goroutine and then get the counter at that index. It worked, but the overhead of that CPUID
instruction was pretty huge, around 30ns
in my benchmarks. It was expected, since it pulls a lot more than just the id.
To go around that, I decided to try out RDTSC
which reads the instruction counter and is pretty fast. As mentioned before, I still want to expose other APIs to the lib where you explicitly provide the counter and seed.
you can try wyrand PRNG in wyhash package
Hey, I'll have a look into JAX, never heard about it.
For the atomic increment is still a lock, so it would not scale beyond a certain number of cores. Most of the time, you don't care as it's not called that often, but when I was working on my
column
benchmark, I ran into the issue where atomic random called too often caused a degradation inside my benchmark.
Sure, anything "atomic" inside of a non-splittable PRNG will cause serious degradation. I'd try really hard to avoid anything shared.
For the thread-local storage, this was actually my initial approach this morning. The issue is that there's no native way to use it in Go, which is a bit annoying.
This is really bad - I didn't look at Go's support for thread local storage, but if it's not there I'd definitely invest energy (even if it's a lot) into implementing my own tiny constrained support for that (at least for Go's default compiler - not for cgo nor any other for the beginning) using unsafe
... Without TLS it's becoming nowadays almost impossible to make scalable algos.
Just out of curiosity, let me mention there are faster alternatives to xxh3 or algos with about the same speed but jaw-droppingly simpler. See https://github.com/rurban/smhasher/ .
Namely fastest (for both bulk & short inputs) seems to be
wyhash
and simpliest from the fastest onesmx3
(btw. I feelmx3
could still have some space for speed optimization as I suspect some false dependencies between instructions in the pipeline... - maybe even good old manual loop unrolling could make a difference, IDK).But maybe xxh3 is better in some ways which I don't see but would like to. I can imagine something related to DBs but I don't know what exactly :wink:.