nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.64k stars 1.47k forks source link

HashSet[uint64] slow insertion depending on values #11764

Closed cwpearson closed 4 years ago

cwpearson commented 5 years ago

The rate of inserting uint64s into a hash set varies wildly with the order and the range of integers inserted

Example

# slow_set.nim
import sets
import times

when isMainModule:
    var hs1: HashSet[uint64]
    var hs2: HashSet[uint64]
    var hs3: HashSet[uint64]

    # insert 0..200k
    var time = cpuTime()
    for i in 0..200_000:
        let k1 = uint64(i)
        hs1.incl(k1)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 100k..200k
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 100_000)
        hs2.incl(k1)
        hs2.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs3.incl(k1)
        hs3.incl(k2)
    echo "time ", (cpuTime() - time)

Current Output

Compiled with nim c -d:release -r slow_set.nim.

time 0.016778
time 1.01831
time 14.043752

Expected Output

I would expect the three different insertion loops to take roughly the same amount of time. They are all inserting the same amount of unique values. In the second loop, I just interleave the insertion order, and in the third loop, I insert some larger numbers.

Possible Solution

Additional Information

This issues https://github.com/nim-lang/Nim/issues/10097 talks about integer hashing and collisions, but all of my uint64s are well below the max uint32.

This also happens with hashTable keys, presumably for similar reasons?

In my actual project I am deduplicating an edge list for a graph where the source and destination vertex for a node are sort of similar to loop 3, so the performance is not good.

$ nim -v
Nim Compiler Version 0.20.2 [MacOSX: amd64]
Compiled at 2019-07-17
Copyright (c) 2006-2019 by Andreas Rumpf

git hash: 88a0edba4b1a3d535b54336fd589746add54e937
active boot switches: -d:release
krux02 commented 5 years ago

In general I have to say that you can always construct a sequence of keys that works very bad for the hash set implementation no matter what the hashing function does. But this does indeed look strange, as these keys don't look like as if they would cause particularly many collisions. Did you measure what part exactly causes this slowdown?

EDIT: I did the benchmark, this loop here is where all the time is wasted: https://github.com/nim-lang/Nim/blob/210988c5324aa71c595e91aafe5a71dc8273a639/lib/pure/collections/hashcommon.nim#L40

cwpearson commented 5 years ago

I added 3 more experiments, with resizing the hashSet first:

import sets
import times

when isMainModule:
    var hs1: HashSet[uint64]
    var hs2: HashSet[uint64]
    var hs3: HashSet[uint64]
    var hs4 = initHashSet[uint64](rightSize(100_000))
    var hs5 = initHashSet[uint64](rightSize(200_000))
    var hs6 = initHashSet[uint64](rightSize(1_100_000))

    # insert 0..200k
    var time = cpuTime()
    for i in 0..200_000:
        let k1 = uint64(i)
        hs1.incl(k1)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 100k..200k
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 100_000)
        hs2.incl(k1)
        hs2.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs3.incl(k1)
        hs3.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    # but insert into a hashSet with space for100k
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs4.incl(k1)
        hs4.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    # but insert into a hashSet with space for 200K
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs5.incl(k1)
        hs5.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    # but insert into a hashSet with space for 1.1M
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs6.incl(k1)
        hs6.incl(k2)
    echo "time ", (cpuTime() - time)

By resizing the hashSet large enough to cover the largest expected integer, we recover all the performance (loop 6) and then some. Making it big enough to just probably hold all of the integers recovers some of the performance (loop 5). Making it large but still too small doesn't have much impact (loop 4).

time 0.017222
time 1.267755
time 16.218246
time 15.879905            # initHashSet(rightSize(100k))
time 9.667273999999999.   # initHashSet(rightSize(200k))
time 0.003250000000001307 # initHashSet(rightSize(1.1M))

It's been a while since I took my CS algorithms course, but is it possible that the logic which controls when to resize the HashSet as it grows has some problems?

cwpearson commented 5 years ago

Just in case

https://github.com/cwpearson/nim-issue-11764

cwpearson commented 5 years ago

Hi, now that 1.0.0 has come out I re-ran this https://github.com/cwpearson/nim-issue-11764:

Nim Compiler Version 1.0.0 [Linux: amd64]
Compiled at 2019-09-23
Copyright (c) 2006-2019 by Andreas Rumpf

git hash: f7a8fc46c0012033917582eb740dc0343c093e35
active boot switches: -d:release
(1) time 0.015417049
(2) time 0.952415456
(3) time 12.335662364
(4) time 12.212460556
(5) time 7.530930397999999
(6) time 0.003257626999996432

I'd expect 1-3 to be roughly the same performance, and 4-6 to be somewhat better than 1-3 due to the pre-allocation. In practice, that is still not what we see.

Araq commented 5 years ago

Looks like you crafted the numbers to produce collisions, that is always a danger when using hashing.

cwpearson commented 5 years ago

Thanks!

It looks like the hash for int is just that int value: https://github.com/nim-lang/Nim/blob/e9f7455bd68d6b03130b0671926515967328410f/lib/pure/hashes.nim#L115-L117.

It seems like the hashset implementation tries the next consecutive bucket when there is a collision: https://github.com/nim-lang/Nim/blob/e9f7455bd68d6b03130b0671926515967328410f/lib/pure/collections/hashcommon.nim#L29-L30

The general wisdom for open addressing sets seems to be that the hash function needs to avoid placing consecutive values in consecutive buckets (wikipedia). "Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent."

1) I agree that collisions are always possible with a hash function, but this choice of hash function seems like it has collisions in a common scenario, where other choices of hash functions would only have collisions in a pathological case. The "efficient" hash implementation for ints is actually causing a big performance issue. 2) Even if you don't agree with 1), I think the concern about clustering summarized in Wikipedia may argue for a redesign of the hash/hashset implementation in the context of integers. When folks are hashing they probably want to avoid collisions, so it might be worth considering using the MurmurHash that is used for other data on ints as well, even if the cost of each hash is higher.

As an aside, I gave IntSet from the Nim lib a try and the performance is much better (case (8) in https://github.com/cpwearson/nim-issue-11764).

Araq commented 5 years ago

But my point is that

   let k1 = uint64(i)
   let k2 = uint64(i + 1_000_000)

is not common. Usually IDs are mostly consecutive. May I ask where your integers come from?

rockcavera commented 5 years ago

I noticed the same problem with the sets package and tables. In order not to open a new issue, I decided to comment here.

For testing I used the developer version 2019-11-14 of Nim. I compiled with gcc, clang and vcc.

I created a repository with demo code and problem reporting. https://github.com/rockcavera/nim-problem

I believe the issue should be reopened.

rockcavera commented 5 years ago

I believe the C# HashSet implementation is good and can be used as an example for Nim.

I believe this is the source code

Here the highu64.txt file was processed in 0.220885200 seconds against 88.57006502151489 seconds for my Nim implementation with Sets.

narimiran commented 5 years ago

I'm already working on it, and the Python's way of probing (link) seems to work nice with the current Nim's implementation.

The remaining problem is: I was not able to make Nim to bootstrap anymore with the new probing.

narimiran commented 4 years ago

This has been fixed in PR #12407.

rockcavera commented 4 years ago

I didn't understand why they closed this issue, since the problem was not solved.

disruptek commented 4 years ago

I think since the original issue seems to be fixed, you should create another issue that references your excellent benchmark so we can track fixing that separate, if similar, problem.

narimiran commented 4 years ago

I didn't understand why they closed this issue, since the problem was not solved.

On my machine, using Nim devel 1.1.1, the original example produces:

time 0.011247608
time 0.009370479000000001
time 0.01103503

The second example produces:

time 0.011447444
time 0.011687636
time 0.009552221
time 0.004966462000000005
time 0.003419083000000003
time 0.004879198000000008

@rockcavera can you be more specific, what problem remains unsolved? (What Nim version are you using?)

rockcavera commented 4 years ago

@narimiran https://github.com/rockcavera/nim-problem

I created this repository to show the performance problem of the sets and tables collections (I haven't tested others from Nim). In it there are 3 solutions using Nim to eliminate equal entries. One using Seq and Sort; another using sets and another tables (hashtables).

The most balanced solution, that there is no disparity between using high or low bits in integers, is using Seq and Sort. Using the sets and tables collections, there is a dramatic drop in performance when bits above 32 are active.

In addition to the Nim solutions, the repository has C#, Perl and mIRC Scripting solutions (the latter is so slow, but so much more consistent than Nim).

I honestly don't know what the problem is for 400000 integer entries with active bits over 32 taking more than 88 seconds with sets and tables collections.

If necessary, I open a new issue.

Edit: I'm using Nim 1.1.1 devel 03-02-2020 Windows 10 64 bits

c-blake commented 4 years ago

All that is really required to fix the above test is adding these three lines right after import times in the above program:

import hashes, bitops
proc hash(x: uint64): Hash =
  Hash(rotateLeftBits(x * 15241094284759029579'u64, 27))

With that I get (with -d:danger, gcc-10, O3 on an i7-6700k @4.7GHz):

time 0.007326147999999999
time 0.005117073
time 0.006675165999999998

No one hash can be best for all key sets, but Nim lets you change it. The above is ok and comes from a pseudo-random number generator actually (called "RoMu" short for Rotate-Multiply). There are more scrambling ones, like hashWangYi1 over at https://github.com/c-blake/adix (in althash). Or your own favorite which might be "better than random" for your keys. E.g., for a static set of keys you can create a perfect hash (see https://en.wikipedia.org/wiki/Perfect_hash_function).

brentp commented 4 years ago

AFAICT, all of the fixes (except #13418) related to this issue have been negated in recent nim.

because of 6a0e87eb3 which made uint and uint64 Ordinal types (docs need updating for this)

the hash used for uint64 by default is:

proc hash*[T: Ordinal](x: T): Hash {.inline.} =
  ## Efficient hashing of integers.
  cast[Hash](ord(x))

this is silly. I am hashing kmers and this makes things impossibly slow for a HashSet[uint64] with only 5 million values. Even the now removed function:

 proc hash*(x: uint64): Hash {.inline.} =
   ## Efficient hashing of `uint64` integers.
  result = cast[Hash](cast[uint](x) * prime)

was not great, but it doesn't have horrible worst-case (actually normal-case) behavior. the hash from @c-blake works quite well. If the nim way is to not even hash integer types, it should be extremely clear in the hashes and sets docs.

On @rockcavera 's repo, if I run:

nim c -d:danger -d:release -r codesetsu64.nim

with nim 1.3.1 (from today). I see:

lowu64.txt
duration: 12.89859890937805
original length: 400000
final length without repeated: 398658

highu64.txt
duration: 91.42858910560608
original length: 400000
final length without repeated: 398691

If I copy @c-blake 's hash function in there, the durations go down to < 0.1 seconds.

c-blake commented 4 years ago

You are free to copy my hash whenever you need it. It's just the first "hop" of the simplest of the "RoMu" random number generators (http://romu-random.org/) effectively taking the key as the seed for the PRNG. It's like one line of code, really.

As already mentioned, there is no "best hash for all cases". Both the identity hash, that *11 one, and that RoMu based one map integer sequences with regular spacing to non-colliding slots in a way better than random, meaning fewer collisions than you would expect from Poisson statistics. These are unusual targets/defaults that I guess most people don't understand. Even Bob Jenkins who did the progenitor hash for most popular modern hashes needed me to tell him about Knott hashes and rotl/rotr back in 1997 even though all that was in Knuth's TAOCP from the 60s.

All that said, I agree both documentation and alternative hashes were lacking. Nim is super duper accepting of documentation PRs. You should do one! Defaults are always tricky, but basically, "key semi-hostility" seems a safer but slightly slower default assumption while hashes optimized for other patterns are maybe better thought of as "for power users" willing to replace hash for their special key sets. We are working on changing the default integer hash to hashWangYi1 while providing a hashIdentity also available as -d:nimV1hash over at https://github.com/nim-lang/Nim/pull/13823. Just didn't quite make the 1.2 cutoff. That is an almost-as-fast-as-RoMu one passing all SMHasher tests for output bit entropy. (SMHasher is kind of the "BigCrush" PRNG test suite equivalent for hash functions, but not really as thorough as BigCrush.) Since you seem comfortable using the devel branch you should see it soon.

Key "total hostility", aka collision performance/DoS attacks require at least least another level or three of finesse that I am trying to do over at https://github.com/c-blake/adix which uses probe depth overlongness (on underfull tables) to signal ordinary resizing as well as rehashing the hash and/or Robin Hood reorganization with the possibility of "quasi-secret" or totally secret hash salt and may even be able to fall back to B-Trees in the end in case your secret hash order is leaked via operation timing. That lets you have fast defaults falling back only as needed. (Falling forward/reversing that fallback is hard since the only general way to know it's safe to deactivate mitigations is to "do them, too, in the background" which is both "too slow by construction" and takes at least 2x the memory.)

c-blake commented 4 years ago

As a quick "quasi-proof" of no truly safe default, consider that MD5 and SHA1 are both far more expensive and are now considered broken, meaning computing another key that collides with a known hash value is no longer considered intractable. Even if that is intractable a table may still be unsafe since it requires a collision in only a few bits. For data analytic purposes like yours these attack scenarios are just theoretical, but they really do happen to hash tables in the wild for things like network services.

Also, while fast&simple, my RoMu based example above does not pass SMHasher at all, but may be better for your use cases. So, even with the coming shift in Nim hashes defaults you may still want that line of code.

brentp commented 4 years ago

@c-blake thanks for your work on this. I have included a ~Yang~ Wang hash in my library to get around this.

c-blake commented 4 years ago

You're welcome. If/once we get it in lib/pure/hashes.nim you can just import hashes to get that instead of defining it locally. We just have tests failing on the CI whose failure I cannot reproduce locally holding us up. (Oh, and in terms of proper attribution it's Wang Yi..So, a Yi hash if you want just his last name which is so short it might even confuse.)

peteroupc commented 4 years ago

I am pleased with seeing this debate on how to "build a better hash table". For your information, here are related links I am aware of:

I also want to give the opinion that if it's possible to mitigate the security issues of hash tables while still achieving reasonable performance, and without resorting to hash functions more complicated than multiply-and-shift, then it should be done.

@tommyettinger, @rurban, please see.

c-blake commented 4 years ago

Yeah. I'm aware of all the issues mentioned in all those sources. What none of your references mention is the original Knuth reason for recommending both multiply-shift (by anything not just the discretized/approximate golden ratio) and the far more "propagated" modulo-prime methods is trying to do better than random by mapping sequences with regular spacing to (usually) distinct table locations. That attempt to be better than random of course makes the mapping "more scrutable" and more vulnerable to attack, but maybe better on many real world key distributions.

You might enjoy reading the top of althash.nim https://github.com/c-blake/adix . Rotating is as fast as shr but does not discard entropy (which doesn't matter if you are doing shr to get into the table range but does if you are doing and mask as Nim tables do). Even better is one round of Wang Yi's approach which really doesn't discard entropy (the top half of the product) (what I call hashWY0 in althash.nim). Then two rounds with 3 salts achieve avalanche with far less CPU activity than Bob Jenkins' hash.

I think a natural progression might be 1) a performance sensitive user defines proc hash(x: SomeInteger): Hash = hashRoMu1(x), then 2) tables measure probe depths and 3) auto switch to hashing that hash with hashWangYi1 if worst case probe depth is still bad, then 4) if worst case depth remains bad maybe try Robin Hood re-organization or 5) 'seeding/keying/salting' the hash with the hash of the address of the data array or a Linux/Windows getrandom. If all that doesn't work you are almost surely under a very savvy attack and a fall back to a B-tree is warranted -- if possible (i.e. the type supports <). Resize based on worst case probe fits into all that more naturally since that then becomes the "0th quasi-mitigation". That might all sound expensive but since we detect the problem before it's very bad each transformation only costs as much as a regular "growth" phase, and in the "common case" of non-hostile keys things just go as fast as ever.

Anyway, the adix tables develop some of these ideas. It's still kind of a "research project" and we should probably get more experience in isolation before foisting stuff on the Nim stdlib. I'm happy to discuss ideas over there. (A whole other direction is to use Fagin extensible hashing which can do away with the "lumpy" big resize costs and also kind of adapt within limits to just where the hash code is weak.)

c-blake commented 4 years ago

Also, I had already mentioned @rurban's repo in adix's althash.nim.

tommyettinger commented 4 years ago

Huh, someone noticed merry-ds. :smiley: @c-blake is right, the "Fibonacci hashing" approach that Merry relies upon to handle "somewhat bad" hash codes is not robust enough in the face of sophisticated attacks, since there's known issues when the hash codes are multiples of the same large Fibonacci number. Merry was meant to mitigate an issue that was trivially exploitable in libGDX' data structures, where under 50 Java Strings (which have hash codes that are easy to make collide) could exhaust heap easily. The only reason collisions resulted in excess memory usage was the way the existing collections handled repeated collisions was to stick the problem elements in a "stash" and double the data structure's internal capacity every other time stash was added to. Repeated doubling, well, quickly ceases to be viable. Merry is a pretty basic open-addressing linear-probing hash table implementation, other than the Fibonacci hashing it uses. I'd suggest looking at other implementations that are more robust if security is a major concern; Merry was meant to be fast to use in games. The JVM's strategy for its HashMap is similar to what @c-blake has already described, with a change in internal implementation based on the frequency of collisions, and it seems to mostly work well for the JVM.

As for the RoMu hash, I'm more concerned about that. Mark Overton doesn't seem to have any evidence that his RoMu generator can produce all possible 64-bit results on its largest cycle, and that's with twice or three times the state as the rotate-multiply hash here and a more complex function. Chaining a rotate and multiply is a bijection, so it's probably not going to have the same distribution issues as RoMu as you have it, but it isn't going to distribute some bits at all well (bit 0 of input is always the same as bit 27 of output). I've looked into using similar approaches myself, and anything that is doing that few operations typically isn't capable of achieving avalanche.

Also, "Wang hash" almost always refers to some frequently-used integer hashes by Thomas Wang see Bob Jenkins' analysis of his functions, while wyhash is, AFAICT, the first hash function made public by Wang Yi, and is not a unary hash like Thomas Wang's hash. I do like wyhash quite a bit, and I've made various tweaks on it for platforms without fast 128-bit multiplication.

peteroupc commented 4 years ago

By "hash functions more complicated than multiply-and-shift" I was thinking more of SipHash and similar cryptographic hash functions, as opposed to hash functions that do the least work necessary to "achieve avalanche", such as Wang Yi-like functions.

c-blake commented 4 years ago

Yeah. I think WY or WangYi are good ways to refer to that hash. I tried. :-) Wasn't saying anything about RoMu other than that it discards less entropy than Fibonacci-style and that itself as an intro to how WangYi's works. To the extent that the Fibonacci number itself is basically unnecessary it might even be fair to label that RoMu as "Fibonacci-esque", TBH. Trying to map sequences to non-colliding which is on-purpose worse than random, but also on purpose better than random is just neglected in treatments past Knuth. For hashing all we care about is the very first hop from the key while PRNGs are evaluated on many hop bases. I really just grabbed his two associated constants (the multiply amount and rotate amount). I've been using my own for decades but figured he at least wrote up his approach.

Avalanche is one of these "safer for easier hostile key sets" properties and is actually detrimental if you suspect the "meta distribution" (distribution of the sets of keys you might encounter) gets you closer to perfect hashing with something Fibonacci-esque. Indeed, one example I have over at https://github.com/nim-lang/RFCs/issues/201 has Robin Hood hashing recover that perfect-hash-like behavior from what would be a perfectly colliding sequence under hashIdentity. If Fibonacci hashing is the "optimization" time forgot, its targeting better than random is the motivation time forgot. Only it's all done in Knuth TAOCP v3 sorting & searching which remains a classic. So, under-exposed/propagated/appreciated/mentioned is perhaps better than "forgot".

Anyway, it is heartening to hear multiple rare voices/outside the usual Nim community voice support for hashWangYi1 as an ok default. We'll try to iron out the test glitches soon and hopefully talk about other things like maybe mitigation sequences in adix.

tommyettinger commented 4 years ago

I actually just ran some SMHasher tests and XXH3, which was recently marked as being in a stable version unless bugs are found, is incredibly fast, averaging 18.816 bytes per cycle. Testing wyhash3 only gets 8.455 byes per cycle, and it's quite a fast hash to begin with. I think there are newer wyhash versions. The fastest hash I've written that passes SMHasher, which is based on wyhash v1 but uses only operations that Java 8 can perform efficiently (no 128-bit multiply), gets a mere 7.057 bytes per cycle. XXH3 is quite a lot of code, adapting to AVX/SSE versions as available and choosing a path based on key length, so it is hard to compare it to the single-small-file relative terseness of wyhash. IIRC, when I tested XXH3 with processor feature detection disabled and the bare minimum used, it didn't perform much better than wyhash, and I think it was on-par with my 7 byte per cycle hash. It seems likely that wyhash will continue to evolve, in some shape or another, and may get faster or have better guarantees of hash distribution.

Also, I'd be curious to see what advantages there are for Robin Hood hashing in Nim; Rust used it for a while and then discarded it in favor of swiss-tables, in part due to some serious performance problems when copying a hash-based data structure. My Merry repo initially was meant to use Robin Hood (and his band of Merry men) hashing, but I wasn't seeing competitive memory or time usage, in large part due to Java limitations. I think anything's better than Cuckoo hashing at this point; there appear to be multiple unresolved aspects of Cuckoo hashing when it is implemented in practice, like how a stash should behave, if there even is a stash. Hopscotch hashing is also interesting; I don't know much about it.

c-blake commented 4 years ago

Well, you should note from the title of this thread that it is about hashing a single integer. So, your bytes/cycle numbers are irrelevant. What matters is what SMHasher calls cycles/hash. But as you say, WangYi's does better here. (EDIT: address actual mention - not trying to be hostile).

SMHasher's distribution/entropy tests are ok, but I always thought it's speed testing to be less than ideally helpful both from its indirect-function pointer operation for something I usually inline and from a reporting point of view. For table lookup with strings you really want an approximate time formula (I fit these via linear regression but loop-unrolling can make them more like staircases than lines). Time in cycles(or ns) =~ cycles + cycles/byte * bytes. When I have tried xx family hashes they have had a very large per-hash constant compared to the WY family. With something like the above formula you can quickly "back of the envelope" how that will play out based on your expected string length. So, for example, one test of XX I got was:

mem-wy: (7.063  +- 0.016) +  (0.18825  +- 0.00012) * bytes
mem-xx: (38.42  +- 0.18) + (0.3236   +- 0.0020)  * bytes
mem-dlcl: (36.218 +- 0.017)  + (0.009825 +- 8.2e-05) * bytes

I inline these hash functions since a null function call is about 15 cycles on my machine and that adds a lot of noise (and I inline them in practice anyway). I only test out to about 128 bytes because I so rarely put strings longer than that in tables but for L1 resident performance I am careful to do the min of several runs before even putting things into the regression.

That dlcl is the hash from "Daniel Lemire, Owen Kaser, Faster 64-bit universal hashing using carry-less multiplications, Journal of Cryptographic Engineering". So, you can see it is very fast (approximately 32x faster than xx and 20x faster than WY and over 5x faster than your xxh3 number), but the string needs to be very long before you start to realize those benefits over WY. How long is easy to back out from those time formulas. In this case, (36.218 - 7.063) / (.1885 - .009825) = 163 bytes. That's long for a key, but short for a "document" (for message digesting). Maybe for hashing of web URIs it would pay off, but I'm not even sure the average URI is so long. Anyway, "time is additive" is something worth underlining here. A time formula is often an easier starting point than a "rate number".

Anyway, in summary, A) optimization of that non-per-byte start-up/finalization cost of a hash matters a lot past a certain point and almost totally for hashing an only 8 byte integer. Once cycles/byte gets below 1-ish that tends to dominate for typical string lengths, B) when I time WY carefully it does better, but even SMHasher gives it a big win for WY in "cycles/hash" which is our additive constant here among those that pass all statistical tests, C) when I time WY it's faster than the XX family as a string hash, but string hashes are not relevant to this particular Nim issue, and I don't really like SMHasher's speed benchmark anyway, and D) Dan Lemire's is over 100 bytes/cycle on my Skylake core, E) part of my selection was just "simplicity" and WY specialized to 8 bytes only has 3 salt-y constants and is easy to describe in less words than this comment. You might see the top of my adix/althash.nim.

But thanks for that feedback/interest. I personally don't believe in some best-for-all-cases hash and think a selection of well vetted choices with easy default switching is the right answer. If you have a hash of similar qualities of simplicity/ease of fallback to non-SSE/AVX/old-skool 32-bit and low per-hash cost that would be interesting. Personally, I'm not even opposed to that more Rotate-Multiply Fibonacci-esque as a default with auto-fallback to WangYi if necessary, but just one simple, strong, broadly deployable hash seemed like a safer default for people who don't know how hash tables work which is kind of the target client code audience.

Oh, and I am aware of the (easily fixed) Rust problems with LP/Robin Hood. RH mostly really shines over vanilla LP when you have a lot of "miss" lookups as in an intersection of two sets where the answer is empty, say, or you want better no-unlucky-key determinism. Swiss Tables also seem very CPU architecture specific (maybe ok for Google's 50 million Xeons or whatever, but our deployment is trying for something more basic/general that might not even have a 64-byte cache line, say), and honestly the numbers in nanoseconds I've seen from Swiss Tables never impressed me much. I'm not sure they're actually faster, but I am sure they're very complex.

c-blake commented 4 years ago

Also, more general "hashing/hash table" discussions are probably better had over at https://github.com/c-blake/adix rather than bugging the Nim guys about side-details. Just raise an issue there or something.

rurban commented 4 years ago

Well, inlining is crucial for hash functions, so they need to be small. wyhash is not really that small.

Swiss tables and friends do make a huge difference, esp. for such VM's. The only problem that all 3 good implementations do need C++ so far. In my VM's I was not yet comfortable with that requirement. But it's only a matter of time until someone ports it over to C. Improvements in C++ had been amazing, I used the best one for SMHasher, and this is even thread-safe

c-blake commented 4 years ago

@rurban, the fallback is big, but the have-128-bit-double-product-on-CPU is not. It's not Rotate-Multiply small, but it's less than 3x more. You may be thinking of the Wang Yi string hash which is much more ornate through much manual unrolling activity, not my specialized 8-byte int adaptation.

Re: Swiss tables, it's all about "compared to what along what dimension?". Time/memory/etc. I get L1 int-keyed hot-everything lookups of like 1/2 a pipeline depth with vanilla LP soup to nuts whole lookup time. On the giant side, time is dominated by the initial uncached memory load or 65-120ns unless you "cook the books" with hot loop benchmarks. In the middle is a muddle of partial prediction/partial caching. Work prediction/prefetching in general creates an enormous amount of confusion in benchmarkers, even among pros, and is a big conversation best had elsewhere (e.g. over at adix).

c-blake commented 4 years ago

(I do realize I was kind of talking about "both" string hashes & int hashes and so no offense, and that conversation going afield is what makes me keep saying to take the topic elsewhere from this uint64 issue.)

c-blake commented 4 years ago

Unless you know, as you well might!, of an 8 byte integer hash that passes SMHasher distro tests and is smaller/faster than roughly:

proc hiXorLo(a, b: uint64): uint64 {.inline.} =
  {.emit: "__uint128_t r=a; r*=b; `result` = (r>>64)^r;".}
proc hashWangYi1*(x: int64|uint64|Hash): Hash {.inline.} =
  const P0 = 0xa0761d6478bd642f'u64
  const P1 = 0xe7037ed1a0b428db'u64
  const P5x8 = 0xeb44accab455d165'u64 xor 8'u64
  Hash(hiXorLo(hiXorLo(P0, uint64(x) xor P1), P5x8))
tommyettinger commented 4 years ago

I'd definitely do some double-checking and make sure that hashWangYi1 is a bijection; if it isn't a bijection but has 64-bit input, 64-bit output, then some outputs are not possible. Most testing suites have a hard time detecting gaps (or repeats) in 64-bit output ranges, and it may not be an issue because the 64-bit number space is so vast. I usually estimate functions like hiXorLo as being unable to produce about 1/3 of all possible results, based on testing with a 32-bit version of a similar wyhash-derived function that can be fully explored. If that's the case here, then two hiXorLo calls would not be able to produce roughly 5/9 of all possible outputs, and the remaining 4/9 of possible outputs would be unevenly distributed. This may be acceptable for some usages.

EDIT: It is definitely not a bijection. I ran a test by M.E. O'Neill meant for detecting random number generators that can't produce duplicates, which is a problem for some PRNGs, but a necessity for hash functions that should be able to return any hash. After about 45 minutes of testing, it had generated 50916406 64-bit values given unique inputs to hashWangYi1. Of those, there were 8 repeats, which is twice as many as it expected for a fairly-distributed PRNG (it gave a p-value of just over 0.95 when treating it as a PRNG), but as a unary hash it should have none.

c-blake commented 4 years ago

@tommyettinger - Solid point. The xor folding definitely makes it non-invertible. { Psychology Replication Crisis-wise, I'm obligated to request a statistically independent second trial to confirm that p=0.95 result. :-) :-) What was your input set? 1..50916406? }

A hot spot in the table only 2x overloaded isn't bad if those whole-64-bit repeats don't translate into >10s of times expected collisions in just lower bits. It's not Cuckoo hashing where slightly weak hashes wreak havoc. ;-) But, of course, they could be collide-y. We do and-mask reduction of hash outputs to table address. So, we care most about the lower bits.

Anyway, I absolutely see how invertibility would add safety if we don't have to give up the SMHasher avalanche-y/high-sensitivity-to-input property or much speed. Do you (or anyone else) know of a comparably small, fast, mixing/sensitive integer hash that is invertible? I have always thought the right answer was to provide a few alternatives with easy switching. So, if you have a better recommendation I am all ears. Thanks!

c-blake commented 4 years ago

(and I know about the Thomas Wang one already mentioned in this thread, but that seems like an awful lot of ALU operations..like 15..22 compared to the 5 of that WangYi1, and I don't know anyone has ever even run the relevant SMHasher 8-byte pattern tests on it or elsewise really vetted it. Someone should do an SMHasher/BigCrush suite specialized for integer hashes..Maybe you!)

tommyettinger commented 4 years ago

Pelle Evensen has an incredibly hard-core routine he evaluates unary hashes with, based on many runs of PractRand (for testing random number generators) with altered input patterns. Anything that passes it is probably more expensive than 5 operations (although I counted more in the wyhash-based hash, some may not run on the ALU: 3 xor, 2 shift, 2 128-bit multiply). There is a nice middle ground with Moremur, which is invertible and only requires 3 xor, 3 shift, 2 64-bit multiply, although they're all sequential. Moremur doesn't pass Evensen's full test battery, but it passes a lot more of it than similar hashes, like MurmurHash3's mixer.

uint64_t moremur(uint64_t x) {
  x ^= x >> 27;
  x *= 0x3C79AC492BA7B653UL;
  x ^= x >> 33;
  x *= 0x1C69B3F74AC4AE35UL;
  x ^= x >> 27;
  return x;
}

One that does pass the full several-day battery of tests is NASAM, which looks like

uint64_t nasam(uint64_t x) {
  // ror64(a, r) is a 64-bit right rotation of a by r bits.
  // the xor of an odd number of rotations of the same term is invertible.
  // here, one of the rotations is effectively a rotation by 0.
  x ^= ror64(x, 25) ^ ror64(x, 47);
  x *= 0x9E6C63D0676A9A99UL;
  // all xorshifts in the same direction, even with multiple xors and shifts, are invertible.
  x ^= x >> 23 ^ x >> 51;
  x *= 0x9E6D62D06F6A9A9BUL;
  x ^= x >> 23 ^ x >> 51;
  return x;
}

Unless you anticipate extremely strange input sets, Moremur probably provides more than enough speed and collision resistance for int hashing.

peteroupc commented 4 years ago

Do you (or anyone else) know of a comparably small, fast, mixing/sensitive integer hash that is invertible?

In this respect, I am aware of the Hash Prospector (by @skeeto), which does a computer search for good hash functions that follow a given sequence of reversible operations. See also the author's article "Prospecting for Hash Functions".

c-blake commented 4 years ago

@tommyettinger, hiXorLo is just 1 multiply (with a double register output) and one xor. In C you "spell register access" with a 64-bit shift and a mask. So, I was counting it as 1 initial xor with P0, then two products and two xors in the machine code. Just reading at these other two make them seem quite a bit more costly, like Thomas Wang's, but maybe more scrambling/uniform. (There are insn latency/dependency questions, too.)

As per @peteroupc 's concern (thanks also for those links), spending too much time to get safety can undermine overall goals and be a false sense of safety against intelligent attack. True attacks remain possible until you "mix in" information an attacker cannot access (which I try to address in adix, but is far afield of this topic). So, 4 hashes - identity, multiply-rotate, WangYi-ish, and maybe one of those two you mention might give people a good selection on the "fast <-> safe" axis. I will look into all of this and thanks much for the pointers.

c-blake commented 4 years ago

Also, @tommyettinger that only 4/9 of output effect you mention may be "mosty canceled" by the two rounds of hiXorLo to get 8/9 with avalanche with low bias..No? If SMHasher is not detecting this bias you suspect that is probably an issue to report. Also, what was your sequence giving p just under .05 (p-value is conventionally P(by chance|null) not 1-that)?

c-blake commented 4 years ago

For example/food for thought, this program:

# Just writes 8 byte binary hashes of 0..<n to stdout for input to PractRand.
import althash, bitops, cligen, cligen/osUt
type HashFun = enum WangYi, MoreMur
proc writeHash(n=99, r=0, fun=WangYi) =
  var h: Hash
  for i in 0 ..< n:
    let x = rotateLeftBits(uint64(i), r)
    case fun
    of WangYi:  h = hashWangYi1(x)
    of MoreMur: h = hashMoreMur(x)
    discard stdout.uriteBuffer(cast[cstring](h.addr), Hash.sizeof.Natural)
dispatch(writeHash, help = { "n": "samples", "r": "rotation",
                             "fun": "hash function"})

and this command (with PractRand-pre0.95 from sourceforge):

./writeHash -n $((1<<32)) | PractRand stdin64 -tf 2

Produce "no anomalies" every time for the first 4 GiNumbers. A good initial indicator, anyway. Just something I ran in 7 minutes. I realize it is not as thorough Pelle Evensen's tests who does like 1000x more than that and then with a bunch of input pattern transforms on top. I am working on reproducing those and tracking down Melissa O'Neill's code (Harvey Mudd's entire CS website was down for me).

c-blake commented 4 years ago

So, using that same program and doing

for r in {0..63}; { echo ROTATION $r; ./writeHash -r$r -n$((1<<26)) -fWangYi | PractRand stdin64 -tf 2 }

does detect 7 "anomalies" but no actual "failures" (not even "mildly suspicious") at ROTATION's 5, 6, 19, 20, 34, 58, and 60:

  Test Name                         Raw       Processed     Evaluation
  [Low8/64]BCFN(2+0,13-3,T)         R= +10.0  p =  1.6e-4   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low8/64]FPF-14+6/16:all          R=  +5.9  p =  4.7e-5   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low4/16]BCFN(2+1,13-3,T)         R=  -7.7  p =1-2.6e-4   unusual
  ...and 520 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]mod3n(5):(4,9-6)         R= +13.8  p =  1.4e-5   unusual
  ...and 520 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]Gap-16:A                  R=  -5.0  p =1-1.1e-4   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low8/32]BCFN(2+1,13-3,T)         R=  +9.8  p =  2.0e-4   unusual
  ...and 562 test result(s) without anomalies
--
  Test Name                         Raw       Processed     Evaluation
  [Low4/16]DC6-9x1Bytes-1           R=  -5.0  p =1-1.5e-3   unusual
  ...and 562 test result(s) without anomalies

Meanwhile your moremur (re-written in Nim):

proc hashMoreMur*(x: int64|uint64|Hash): Hash {.inline.} =
  ## This is from https://github.com/tommyettinger
  var x = uint64(x)
  x = x xor (x shr 27)
  x = x * 0x3C79AC492BA7B653'u64
  x = x xor (x shr 33)
  x = x * 0x1C69B3F74AC4AE35'u64
  x = x xor (x shr 27)
  result = Hash(x)

and this:

for r in {0..63}; { echo ROTATION $r; ./writeHash -r$r -n$((1<<26)) -fMore | PractRand stdin64 -tf 2 }

produces 247 warnings and errors..too many to list outright, but a summary is:

37 unusual
10 mildly_suspicious
14 suspicious
18 very suspicious
22 VERY SUSPICIOUS
146 FAIL

I would have to say, that WangYi seems both faster, smaller, and stronger than MoreMur from this preliminary evaluation. (unless I screwed up translating to Nim. I get hashMoreMur(123456) = -5874516080525399851)

c-blake commented 4 years ago

Actually, that uint64_t nasam(uint64_t x) you posted gets even more "unusual" flags than my hashWangYi1 above - 12 flags vs the7 of WangYi. So, that means A) Pelle Evensen's tests aren't excluding only "unsual" flags and B) hashWangY1 might well pass his whole suite going against your expecations. I realize I'm only doing 2*26 8 = 1/2 a GiB of tests and he's doing a terabyte or whatever, but I don't really feel like waiting 7 days and 26/64 bits isn't so different from 40/64. They're both partial explorations. I didn't try bit reversal, though. Maybe I should just send Pelle the code and have him try it. :-)

c-blake commented 4 years ago

Also, I'm not sure if any of these PractRand tests (and possibly Melissa's) are Bonferroni corrected. I did grep -i for multiple.c and bonferroni and found nothing. With so many tests you would definitely want to correct for the "omnibus pass/fail conclusion". A single test (or 5% of them) failing at a significance threshold of 5% is to be expected even under the null.

peteroupc commented 4 years ago

With "Melissa O'Neill's tests", I think you may be referring to the Birthday Problem test at https://github.com/imneme/birthday-problem-test (see also: https://www.pcg-random.org/posts/birthday-test.html ; @imneme).

c-blake commented 4 years ago

I'm referring to whatever @tommyettinger was referring to. :-) I suspect it was a false positive for badness. It sounded borderline at worst, and the multiple comparisons point definitely matters. But thanks for those links, Peter. I'll take a look later! Cheers

c-blake commented 4 years ago

I realize I made a performance comment I should maybe have backed up. So, here is a little table I have for 1e6 numbers on 3 computers. Time is min(5 runs) in ns/hash. Compiled with Nim ...d0942d42a1; gcc-10.0.1 with PGO. Source code is in adix/tests/writeHash.nim.

CPU/Hash  WangYi  MoreMur  NASAM
i7-6700k  1.2843  1.4986   1.6379  (4.70 GHz)
amd2950x  1.7293  1.9127   2.2522  (3.78 GHz)
i5-540m   2.9324  3.1191  11.9170  (2.53 GHz)

(That time actually also counts the bit rotation that is part of the test for output randomness and whatever loop overheads if you look at writeHash.nim.) This is only a super-hot-everything benchmark, but it tracks with how the functions "look" to me. I don't see why any of the three would have a particular advantage for various warm-ups.

brentp commented 4 years ago

is there a reason you're not comparing splitmix? that is quite fast for my tests (which include the hashtable insert+lookup). i'm curious how it performs in yours.

c-blake commented 4 years ago

Nope. No good reason. :-) { Well, other than that there are a lot of hashes out there and I had no ambition of ever doing an exhaustive comparison..just finding one quick, well mixing one that passed a lot of statistical tests. }