Open Kerollmops opened 2 years ago
We had a conversation with @MarinPostma and we thought that we could experiment something with the standard binary heap and some RefCell
s. Storing the EntryBoundAlignedBuffer
in a RefCell
and an Rc
. Using this Rc<RefCell<EntryBoundAlignedBuffer>>
in the EntryBounds
to be able to implement PartialOrd/Eq
and use these in the binary heap.
Regarding the possibility of duplicate keys, I proposed that we store the bounds_count
number directly into the EntryBound
struct to be able to differentiate them, this number is always incrementing and reset when the sorter is dumped into the writer. We must be careful to Ordering::reverse
as the binary heap is a max-heap and we want a min-heap.
We use grenad in Meilisearch, and when we benchmark and flamegraph the indexing process, which is intensively utilizing this library, we can see that 12-15% of the time is passed in the
sort_unstable_by_key
of the sorter. This sort is executed to ensure that the values are when we need to dump all the in-memory entries into a writer.According to the documentation, sorting the entries by using
sort_unstable_by_key
is O(m n log(n)) where m is the key function. As our key function, the function that computes our key, can be considered O(1), our current sorting solution should be O(n * log(n)).Inserting an entry in our current data structure is O(1) but not when we need to dump into the writer.
https://github.com/meilisearch/grenad/blob/dd79a338b4be073913cb538b173769042188558d/src/sorter.rs#L246-L253
The sorter constraints
Some of the rules that apply to the sorter are:
Those constraints made me think of a more straightforward implementation of a binary heap, where we do not need to implement removing values but only read them in order, i.e. the binary heap's
into_sorted_iter
.Using a binary heap
The only downside that I see about using a heap is about supporting entries with duplicate keys. One solution I see to keep the algorithm simple on both sides, the binary heap inserting system and the current merging system, is to postfix every key by an incrementing big-endian number. This way, we can keep the order and uniqueness of the entries. The downside of this solution is it can impact memory usage. The impact depends on the workload and the size of the inserted keys.
According to the documentation of the standard binary heap, inserting a value is O(1), and poping values is O(log(n)).
Adding benchmarks
We need to introduce more benchmarks to measure our progress on this subject and avoid regressions. To do so, we need to find suitable real-world datasets or find a way to generate them properly. Measuring is complex.