rust-lang / hashbrown

Rust port of Google's SwissTable hash map
https://rust-lang.github.io/hashbrown
Apache License 2.0
2.46k stars 288 forks source link

[Draft] Multi-map implementation #591

Open zakarumych opened 1 week ago

zakarumych commented 1 week ago

Adds HashMultiMap that allows storing multiple values with the same key.

This is a draft to get some feedback on the idea.

Notable API differences from HashMap are following:

TODO:

clarfonthey commented 1 week ago

I like this idea because the current definition of the tables kind of forbid this kind of invariant, and thus don't let you implement them yourself. Will try and comment on the code specifically when I have the time.

cuviper commented 1 week ago

There are several multi-maps on crates.io, the most downloaded being https://crates.io/crates/multimap. Maybe it would be better to contribute there? They are currently wrapping std with HashMap<K, Vec<V>>, but it would be worth comparing to a HashTable<(K, V)> with duplicates as you've done. There are tradeoffs in both directions.

clarfonthey commented 1 week ago

I think that allowing a hash table to contain multiple keys directly would be nice since it helps a lot with cache locality, and generally, multimaps only contain a couple of duplicated keys. That said, you're right that the main benchmark should probably be HashTable<(K, V)>, since even among multimaps, you rarely need multiple identical key-value pairs.

zakarumych commented 1 week ago

Yeap, that was the idea. Having whole Vec<V> when you may only have like 1 or 2 of them seems wasteful. You do lose locality of Vs for the same K though. It's a tradeoff. But I think it's good to have options to choose from.

Amanieu commented 1 week ago

One concern I have with multi-map implementations where multiple identical entries are inserted directly into the map is that you're intentionally introducing hash collisions, which open-addressing hash tables may not handle very well. This is not a concern for multimap which is based on a HashMap<K, Vec<V>> or hash tables based on chaining.

zakarumych commented 1 week ago

Is it collision when the same key has the same hash?

This implementation forces continued probing on insert if key is already occupied and not empty slots in the same group are found. When you insert many entries with the same key, there won't be good distribution, they'll all fall into same group and then into the next one.

I think we can at least benchmark it to see, but I suppose this would work better for collections with not too many values per key than HashMap<K, Vec<V>>. Probably until values of the same key won't fit into a group.

Amanieu commented 1 week ago

Essentially the time complexity for hash table operations ceases to be amortized O(1) and becomes O(k) where k is the number of duplicates of a key.

zakarumych commented 1 week ago

Which operations?

Insert - kinda. Finding first match - still O(1) Finding next match - same. Removing entry is the same as finding it.

cuviper commented 1 week ago

Yeap, that was the idea. Having whole Vec<V> when you may only have like 1 or 2 of them seems wasteful.

You could use SmallVec to avoid the heap for infrequent keys.

You do lose locality of Vs for the same K though. It's a tradeoff.

Another option compared to HashTable<(K, V)> is to bundle those in HashTable<Vec<(K, V)>> (or SmallVec). Then lookups can compare on just the first key, but you can still store duplicates that may have differences outside of Hash+Eq.

But I think it's good to have options to choose from.

Given that there are multiple reasonable options, I think it makes more sense to publish it in a separate crate, rather than "blessing" one choice in hashbrown directly.

clarfonthey commented 6 days ago

So, I should clarify my specific preferences here: I think that it would be useful to allow tables with duplicate keys in hashbrown. The reasoning for this is, while there are plenty of arguments that this is not ideal for maps, there are some situations where storing the entire map in one buffer is preferable despite the issues. Right now, there is technically no supported way of even using hashbrown to create a table with duplicate keys, meaning that having a Vec or SmallVec to store multiple values is the only supported way.

I think that it's fair to say that offering an entire dedicated Map or Set API for this is undesired since this is a nonstandard use case, but I think that at least offering some way of doing this via HashTable, or as a separate HashMultiTable type, would be desirable. That way, even if the actual implementations have to be provided by another crate, they can still reuse the hash table implementation.

cuviper commented 6 days ago

I think that it would be useful to allow tables with duplicate keys in hashbrown. [...] Right now, there is technically no supported way of even using hashbrown to create a table with duplicate keys,

What's stopping that? You can already use HashTable::insert_unique for duplicates, since nothing actually requires it to be unique, and then iter_hash to lookup multiple matches. I guess removing multiples would be inefficient, but we could add something like extract_hash_if.

zakarumych commented 6 days ago

What's stopping that?

Documentation saying bad things would happen if insert_unique is used to insert a duplicate.

clarfonthey commented 6 days ago

Yeah, the docs currently don't say what happens when duplicates are inserted, and it would be nice to find a way to allow that in a reasonable way and maybe even test for it. I wouldn't feel comfortable saying that duplicates are allowed until both the docs are updated and tests are added to verify it works as intended.

cuviper commented 6 days ago

The only relevant doc I found is at the top of HashTable:

Due to its low-level nature, this type provides fewer guarantees than HashMap and HashSet. Specifically, the API allows you to shoot yourself in the foot by having multiple elements with identical keys in the table. The table itself will still function correctly and lookups will arbitrarily return one of the matching elements. However you should avoid doing this because it changes the runtime of hash table operations from O(1) to O(k) where k is the number of duplicate entries.

Maybe we should soften the bit about "shoot yourself in the foot" since it "will still function correctly."

zakarumych commented 6 days ago

Right, it was HashMap::insert_unique_unchecked that promised all the bad things if used with duplicate keys, so HashTable::insert_unique is the way to go.

Amanieu commented 6 days ago

Maybe we should soften the bit about "shoot yourself in the foot" since it "will still function correctly."

The footgun here specifically refers to the time complexity of operations changing from O(1) to O(k) if you have k duplicates. Perhaps this comment could be clarified.

Multiple keys are otherwise perfectly fine with HashTable, it doesn't have any concept of a "key" itself, just a hash. The existence of iter_hash explicitly implies that multiple items with the same hash value are supported.

Which operations?

Insert - kinda. Finding first match - still O(1) Finding next match - same. Removing entry is the same as finding it.

Actually these are still O(k) because duplicate keys will displace other elements out of their preferred position in the hash table. Consider an extreme example of a map filled with 1M keys "A" and 1M keys "B" (in that order): you would expect a "B" to be displaced by 0.5-1M from its preferred spot, requiring that many probes before being found.