Closed fulmicoton closed 7 years ago
It might be more important than I initially thought. See #135. Performance are degrading extremely fast when indexing a single field with a very high cardinality.
I'm not sure RobinHood is a good fit when you don't store the hashes (you need the hash of other items on insert/delete and optionally on lookup).
Good point. What do you think would be the best hashmap impl here?
It should be properly measured, but it seems to me collision are extremely expensive when indexing the movielens dataset. One strategy we could use to mitigate this cost would be to add some info within the table (1 or 2 extra byte of hash for instance) to make sure that handling most collision do not require to jump in memory to compare the actual strings.
We also might want to express the concept of saturation as a collision rate.
taking over this issue
Hi @fulmicoton, do you think this is still an issue? I may take a stab at it.
Actually I think we are ok on that one now.
As tantivy's
SegmentWriter
are using a memory arena, tantivy is not using rust standard library hashmap. The current implementation is using very simplistic linear probing with a djb2 hash.If possible improve the quality of the code there (without hurting performance), and replace linear probing with something more memory efficient (robin hood hashing?).