HashSet[uint64] slow insertion depending on values

cwpearson commented 5 years ago

The rate of inserting uint64s into a hash set varies wildly with the order and the range of integers inserted

Example

# slow_set.nim
import sets
import times

when isMainModule:
    var hs1: HashSet[uint64]
    var hs2: HashSet[uint64]
    var hs3: HashSet[uint64]

    # insert 0..200k
    var time = cpuTime()
    for i in 0..200_000:
        let k1 = uint64(i)
        hs1.incl(k1)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 100k..200k
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 100_000)
        hs2.incl(k1)
        hs2.incl(k2)
    echo "time ", (cpuTime() - time)

    # interleave insert 0..100k and 1.0M..1.1M
    time = cpuTime()
    for i in 0..100_000:
        let k1 = uint64(i)
        let k2 = uint64(i + 1_000_000)
        hs3.incl(k1)
        hs3.incl(k2)
    echo "time ", (cpuTime() - time)

Current Output

Compiled with nim c -d:release -r slow_set.nim.

time 0.016778
time 1.01831
time 14.043752

Expected Output

I would expect the three different insertion loops to take roughly the same amount of time. They are all inserting the same amount of unique values. In the second loop, I just interleave the insertion order, and in the third loop, I insert some larger numbers.

Possible Solution

Additional Information

This issues https://github.com/nim-lang/Nim/issues/10097 talks about integer hashing and collisions, but all of my uint64s are well below the max uint32.

This also happens with hashTable keys, presumably for similar reasons?

In my actual project I am deduplicating an edge list for a graph where the source and destination vertex for a node are sort of similar to loop 3, so the performance is not good.

$ nim -v
Nim Compiler Version 0.20.2 [MacOSX: amd64]
Compiled at 2019-07-17
Copyright (c) 2006-2019 by Andreas Rumpf

git hash: 88a0edba4b1a3d535b54336fd589746add54e937
active boot switches: -d:release

c-blake commented 4 years ago

Speed-wise, splitmix64 from (http://xorshift.di.unimi.it/splitmix64.c) is slower than NASAM on that Skylake/i7-6700k (1.7123 ns = 8 cycles). Not sure I see the point in running stat tests on it since it's the slowest of the bunch. OTOH, you might find some modest 2 cycle gain (in some amortized sense) by moving to hashWangYi1. (EDIT: I got 12 "unusual" PractRand reports for that splitmix64. So, the same as NASAM and more than WangYi1, but none of them are even at the "mildly suspicious" level, except Moremur which breaks all over the place.)

I say "amortized" because, of course, a hot loop like that is really overlaying as much parallel execution as the CPU can discover/use which could be quite a lot. It may be that, e.g., it takes 12 cycles to do one Wang Yi hash but on average there are enough CPU execution units to be doing 2 at a time. I am very aware that almost all CPU/memory system engineering since the early 1990s has made interpreting microbenchmarks harder. So, I try to use careful language. The best benchmark is always your actual application.

tommyettinger commented 4 years ago

Moremur is meant to be an improved set of constants for the same algo as SplitMix64, it just doesn't include the state increment by 0x9e3779b97f4a7c15 that Vigna's code uses. SplitMix64 has known PractRand failures when that state increment (AKA gamma) is too small (1 and 3 are known problem gammas, as is 0xAAAAAAAAAAAAAAAB), and the actual set of problematic gammas, and to what degree, is unknown. Moremur doesn't fully avoid the problems in splitmix64, and it still absolutely has problem gammas, but it should do better with consecutive states than splitmix64. If you test SplitMix64 with a gamma of 1 instead of 0x9e3779b97f4a7c15 , I'd expect many more than 12 anomalies, and more severe. You can look at the analysis Pelle Evensen already did on SplitMix64 and its relative from MurmurHash3, too: http://mostlymangling.blogspot.com/2018/07/on-mixing-functions-in-fast-splittable.html

The set of numbers I used for the birthday problem test doesn't really matter in a test for the presence of repeated outputs, as long as the inputs are distinct. It does matter that it found any repeats, because it means there are fewer outputs than inputs. I'm trying to find the non-cryptographic hash construction guidelines I've been going by, and I think this was it, from Bob Jenkins: https://burtleburtle.net/bob/hash/evahash.html#Funneling (I'm considering hashWangYi1 a mixing function). I'll have to check on a different computer to see what initial seed the birthday problem test used (it was 32-bit), but it incremented that seed repeatedly by 0x6C8E9CF570932BD5, which makes a reasonable Weyl sequence (needs to generate 2 to the 64 states before it cycles or repeats an input; 0x9e3779b97f4a7c15 is used as the state transition in Vigna's splitmix64.c file), and gave that state to hashWangYi1. Repeatedly giving the output as the input to the next call would be a bad idea; I didn't do that. I can run the test on consecutive u64 as well.

c-blake commented 4 years ago

I cannot generate a single repeat using that same adix/test/writeHash.nim piped to a new adix/tests/repeats.nim. I'm just using that same 2**26 length input sequence as before with 64 rotations, but that is 4 GiNumbers. At a minimum, if this repeating effect exists at all, it must be tiny or require very specially constructed sequences which probably makes it a non-issue since any hash has some attack pattern.

Also, this follow-up makes your initial report:

generated 50916406 64-bit values given unique inputs to hashWangYi1. Of those, there were 8 repeats, which is twice as many as it expected

hard to understand - 8 repeats being 2x expected, but now you are saying we want zero repeats which is not at all 8/2=4. So, I feel like that test must be "repeats when range-reduced/masked" which is, of course, a different story than the full 64-bit range and definitely requires "repeated trials". And definitely be careful about Melissa's confusing (or enlightend?!?!) tendency to write p-values as their complements (p-value here, 1 - p-value there, etc.).

c-blake commented 4 years ago

Also, for a non-repeating sequence of inputs mod 2**anyPower, can't we simply use an odd number increment mod 2**thatPower? So, for 2**64 an set of large number|1 increments? I guess what I am thinking is to try many sequences with various odd offsets (perhaps exponentially expanding in size). This is more direct than trying to assess just via worse than Poisson collision rates. I'm also unsure this hash fits the model Bob has for his funneling theorem. It may escape his invertibility condition.

c-blake commented 4 years ago

Ok. I added a couple lines to adix/tests/writeHash.nim to write in hex format instead of binary and reproduced the greater than random collisions in the full 64-bit address space with (in the interests of reproducibility for anyone who wants):

$ writeHash -H -fW -n$[1<<32] -s7822362180758744021|sort -T. -S$[40<<30]|uniq -d
23367F3A36E60564
3AF61091CD4B167A
$ writeHash -H -fW -n$[1<<32] -s5432154321987654321|sort -T. -S$[40<<30]|uniq -d
41D5793C53CDAD25
5A5ACC1FC095E739
9897D8E21FD55055
CCCBC84FA87326E1
FDE47C0D9C053AE2
$ writeHash -H -fW -n$[1<<32] -s4321543219876543215|sort -T. -S$[40<<30]|uniq -d
<EMPTY>
$

(Related to the above - if you have an 8B-int binary external merge sort, it will use 8/17 the RAM/disk space/bandwidth, but coreutils sort does not do this, AFAIK, and you want "." in the -T. to be /dev/shm or at least an NVMe device)

So, on average for 2**32 entries (2+5+0)/3= 2.33 slots with a collision when I believe one expects k**2/(2N) = 1/2 for the full 64-bit range. So, about 4x too many in this very limited measurement (which took 3 hours, but I think is about a 4 sigma result to @tommyettinger 's 2sigma result). So, this slight elevation of 2-5x is likely real. Maybe that 4/9 gets squared (from nested application) and it's 81./16 =~ 5x.

Anyhow, I don't think it's disqualifying unless it's the tip of a very nasty iceberg which seems unlikely given how it passes SMHasher and PractRand (and how many hours it takes to even measure the defect reliably). If we were targeting Cuckoo hashing with 32-bit hash codes then it might be catastrophic, but it should be a fine default for linear probing/robin hood LP which are robust to collisions in hash codes. It still seems better in a vague/omnibus sense than Tommy's MoreMur recommendation and I'm very sympathetic (before he ever mentioned it) to @peteroupc 's concern about unnecessary randomness optimization. If there weren't 64-bit CPUs where NASAM was 4x slower then that could be the default, but as-is, it makes the most sense (to me) to provide NASAM as a fallback and a few less-random-on-purpose "fall forwards" like identity & RoMu1. (As mentioned, I'm not personally against RoMu1 or identity as defaults if there is an auto-fallback mechanism, as I am working on in adix, or at least a probe depth too large warning mechanism, and I hope to someday get such smarts incorporated into the Nim stdlib.)

Thanks for everyone's participation/help on this, btw, especially @tommyettinger (who I hope to have saved a little time/work by following up like this). Open source is too often thankless.

peteroupc commented 4 years ago

My concern was less about "over-optimization" of hash functions and more about using alternative approaches than the all-too-fashionable SipHash and other keyed hash functions to mitigate security attacks, as long as performance is maintained by using one or more of these approaches. In this sense, I am favorable of the OpenJDK strategy of switching a hash bucket to a binary tree (which has logarithmic time complexity) when it detects a high number of collisions in that bucket.

c-blake commented 4 years ago

Well, we agree 100% and adix is all about such auto/staged mitigations. I should stop trying to characterize the opinions of others. Sorry. It's often hard to voice support without some pithy reference. :-)

Clyybber commented 4 years ago

Since #13823 is merged, this can be closed.

nim-lang / Nim