Improve the SegmentWriter's HashMap

quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

MIT License

12.01k stars 670 forks source link

Improve the SegmentWriter's HashMap #126

Closed fulmicoton closed 7 years ago

fulmicoton commented 7 years ago

As tantivy's SegmentWriter are using a memory arena, tantivy is not using rust standard library hashmap. The current implementation is using very simplistic linear probing with a djb2 hash.

If possible improve the quality of the code there (without hurting performance), and replace linear probing with something more memory efficient (robin hood hashing?).

fulmicoton commented 7 years ago

It might be more important than I initially thought. See #135. Performance are degrading extremely fast when indexing a single field with a very high cardinality.

arthurprs commented 7 years ago

I'm not sure RobinHood is a good fit when you don't store the hashes (you need the hash of other items on insert/delete and optionally on lookup).

fulmicoton commented 7 years ago

Good point. What do you think would be the best hashmap impl here?

fulmicoton commented 7 years ago

It should be properly measured, but it seems to me collision are extremely expensive when indexing the movielens dataset. One strategy we could use to mitigate this cost would be to add some info within the table (1 or 2 extra byte of hash for instance) to make sure that handling most collision do not require to jump in memory to compare the actual strings.

fulmicoton commented 7 years ago

We also might want to express the concept of saturation as a collision rate.

fulmicoton commented 7 years ago

taking over this issue

arthurprs commented 7 years ago

Hi @fulmicoton, do you think this is still an issue? I may take a stab at it.

fulmicoton commented 7 years ago

Actually I think we are ok on that one now.