Open zakarumych opened 1 week ago
I like this idea because the current definition of the tables kind of forbid this kind of invariant, and thus don't let you implement them yourself. Will try and comment on the code specifically when I have the time.
There are several multi-maps on crates.io, the most downloaded being https://crates.io/crates/multimap. Maybe it would be better to contribute there? They are currently wrapping std
with HashMap<K, Vec<V>>
, but it would be worth comparing to a HashTable<(K, V)>
with duplicates as you've done. There are tradeoffs in both directions.
I think that allowing a hash table to contain multiple keys directly would be nice since it helps a lot with cache locality, and generally, multimaps only contain a couple of duplicated keys. That said, you're right that the main benchmark should probably be HashTable<(K, V)>
, since even among multimaps, you rarely need multiple identical key-value pairs.
Yeap, that was the idea. Having whole Vec<V>
when you may only have like 1 or 2 of them seems wasteful.
You do lose locality of V
s for the same K
though. It's a tradeoff.
But I think it's good to have options to choose from.
One concern I have with multi-map implementations where multiple identical entries are inserted directly into the map is that you're intentionally introducing hash collisions, which open-addressing hash tables may not handle very well. This is not a concern for multimap
which is based on a HashMap<K, Vec<V>>
or hash tables based on chaining.
Is it collision when the same key has the same hash?
This implementation forces continued probing on insert if key is already occupied and not empty slots in the same group are found. When you insert many entries with the same key, there won't be good distribution, they'll all fall into same group and then into the next one.
I think we can at least benchmark it to see, but I suppose this would work better for collections with not too many values per key than HashMap<K, Vec<V>>
. Probably until values of the same key won't fit into a group.
Essentially the time complexity for hash table operations ceases to be amortized O(1) and becomes O(k) where k is the number of duplicates of a key.
Which operations?
Insert - kinda. Finding first match - still O(1) Finding next match - same. Removing entry is the same as finding it.
Yeap, that was the idea. Having whole
Vec<V>
when you may only have like 1 or 2 of them seems wasteful.
You could use SmallVec
to avoid the heap for infrequent keys.
You do lose locality of
V
s for the sameK
though. It's a tradeoff.
Another option compared to HashTable<(K, V)>
is to bundle those in HashTable<Vec<(K, V)>>
(or SmallVec
). Then lookups can compare on just the first key, but you can still store duplicates that may have differences outside of Hash+Eq
.
But I think it's good to have options to choose from.
Given that there are multiple reasonable options, I think it makes more sense to publish it in a separate crate, rather than "blessing" one choice in hashbrown
directly.
So, I should clarify my specific preferences here: I think that it would be useful to allow tables with duplicate keys in hashbrown
. The reasoning for this is, while there are plenty of arguments that this is not ideal for maps, there are some situations where storing the entire map in one buffer is preferable despite the issues. Right now, there is technically no supported way of even using hashbrown to create a table with duplicate keys, meaning that having a Vec
or SmallVec
to store multiple values is the only supported way.
I think that it's fair to say that offering an entire dedicated Map
or Set
API for this is undesired since this is a nonstandard use case, but I think that at least offering some way of doing this via HashTable
, or as a separate HashMultiTable
type, would be desirable. That way, even if the actual implementations have to be provided by another crate, they can still reuse the hash table implementation.
I think that it would be useful to allow tables with duplicate keys in
hashbrown
. [...] Right now, there is technically no supported way of even using hashbrown to create a table with duplicate keys,
What's stopping that? You can already use HashTable::insert_unique
for duplicates, since nothing actually requires it to be unique, and then iter_hash
to lookup multiple matches. I guess removing multiples would be inefficient, but we could add something like extract_hash_if
.
What's stopping that?
Documentation saying bad things would happen if insert_unique is used to insert a duplicate.
Yeah, the docs currently don't say what happens when duplicates are inserted, and it would be nice to find a way to allow that in a reasonable way and maybe even test for it. I wouldn't feel comfortable saying that duplicates are allowed until both the docs are updated and tests are added to verify it works as intended.
The only relevant doc I found is at the top of HashTable
:
Due to its low-level nature, this type provides fewer guarantees than HashMap and HashSet. Specifically, the API allows you to shoot yourself in the foot by having multiple elements with identical keys in the table. The table itself will still function correctly and lookups will arbitrarily return one of the matching elements. However you should avoid doing this because it changes the runtime of hash table operations from O(1) to O(k) where k is the number of duplicate entries.
Maybe we should soften the bit about "shoot yourself in the foot" since it "will still function correctly."
Right, it was HashMap::insert_unique_unchecked
that promised all the bad things if used with duplicate keys, so HashTable::insert_unique
is the way to go.
Maybe we should soften the bit about "shoot yourself in the foot" since it "will still function correctly."
The footgun here specifically refers to the time complexity of operations changing from O(1) to O(k) if you have k duplicates. Perhaps this comment could be clarified.
Multiple keys are otherwise perfectly fine with HashTable
, it doesn't have any concept of a "key" itself, just a hash. The existence of iter_hash
explicitly implies that multiple items with the same hash value are supported.
Which operations?
Insert - kinda. Finding first match - still O(1) Finding next match - same. Removing entry is the same as finding it.
Actually these are still O(k) because duplicate keys will displace other elements out of their preferred position in the hash table. Consider an extreme example of a map filled with 1M keys "A" and 1M keys "B" (in that order): you would expect a "B" to be displaced by 0.5-1M from its preferred spot, requiring that many probes before being found.
Adds
HashMultiMap
that allows storing multiple values with the same key.This is a draft to get some feedback on the idea.
Notable API differences from
HashMap
are following:fn insert
always add new valueget*
methods replaced with methods to return iterator over values with keys equal to queried one.TODO: