quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.75k stars 649 forks source link

Merge policy works unexpectedly for uncommitted documents #2454

Closed patrik-simunic-cz closed 1 month ago

patrik-simunic-cz commented 1 month ago

Describe the bug

Context: I'm trying to drastically reduce the number of segments in an index to reduce IO latency (we're running Tantivy in Lambda with EFS attached, so network latency is an issue with 100(0)s of files). I've written a custom merge policy intended to merge segments with less then some threshold (a few thousands) of documents. In the Lambda (test/prod) we commit after every change (delete + add document), however during local testing, I add all (about 22k) documents at once and only then call commit once.

What I noticed locally is that with the custom merge policy didn't really work. So I tried to use the MergeWheneverPossible merge policy that I found in tests here: https://github.com/quickwit-oss/tantivy/blob/main/src/indexer/merge_policy.rs#L48 To my surprise, the MergeWheneverPossible merge policy produced even more segments than the default one.

Looking at the implementation at indexer/segment_updater.rs#L544, I see that committed and uncommitted segments are handled separately. I guess that makes sense. However, I wasn't trying to merge committed and uncommitted segments together - all 22k documents were uncommitted, yet it seems the MergeWheneverPossible merge policy didn't apply to all those uncommitted documents.

When I tried to (locally) call commit after every single add_document, it run as a charm. I mean, true, the indexing took more than an hour, but the number of emitted segments successfully dropped to just a few segments. So I guess the merge policy is correctly applied to committed documents, it's just not applied correctly to all those tens of thousands of uncommitted documents.

When I tried to log what's going on, the logs confirmed an issue:

fn compute_merge_candidates(&self, segment_metas: &[SegmentMeta]) -> Vec<MergeCandidate> {
    println!("Segment metas (# = {}): {:?}", segment_metas.len(), segment_metas);
    ...
}

When commit was called after every change, segment_metas was indeed a vector full of segments to be considered for merging. However, when commit was called only once, after adding 22k documents, then the segment_metas vector never contained more then a single item. Sometimes it was empty, sometimes it contained 1 single item. Therefore no merge could really be considered.

I don't think it's a problem for our use-case, since we're committing after every change anyway (as SQS records keep coming). It's just something I noticed under the unique conditions of local testing. Although, I imagine not everyone commits after every change, so there might exist some use-cases in which this might cause some unexpected behaviour.

Which version of tantivy are you using? tantivy = "0.22.0"

To Reproduce

I will try to remember to create a minimal reproducible demo once I have some capacity, but I hope for the moment the above description is sufficient.

patrik-simunic-cz commented 1 month ago

Alright, something even fishier is going on. It seem the issue is not even related to uncommitted documents. I once again tried to index all 22k documents one by one, calling commit after every add_document. I tried it twice - once with the MergeWheneverPossiblePolicy:

#[derive(Debug, Clone)]
pub struct MergeWheneverPossiblePolicy {}

impl MergeWheneverPossiblePolicy {
    pub fn new() -> Self {
        MergeWheneverPossiblePolicy{}
    }

    pub fn as_box(self) -> Box<Self> {
        return Box::new(self);
    }
}

impl MergePolicy for MergeWheneverPossiblePolicy {
    fn compute_merge_candidates(&self, segment_metas: &[SegmentMeta]) -> Vec<MergeCandidate> {
        let segment_ids = segment_metas
            .iter()
            .map(|segment_meta| segment_meta.id())
            .collect::<Vec<SegmentId>>();

        if segment_ids.len() > 1 {
            return vec![MergeCandidate(segment_ids)];
        }

        vec![]
    }
}

and second time with my own MarlinPolicy:

#[derive(Debug, Clone)]
pub struct MarlinMergePolicy {
    target_docs_per_segment: u32,
}

impl MarlinMergePolicy {
    pub fn new(target_docs_per_segment: u32) -> Self {
        MarlinMergePolicy{ target_docs_per_segment }
    }

    pub fn as_box(self) -> Box<Self> {
        return Box::new(self);
    }
}

impl MergePolicy for MarlinMergePolicy {
    fn compute_merge_candidates(&self, segment_metas: &[SegmentMeta]) -> Vec<MergeCandidate> {
        let mut merge_candidates: Vec<(u32, Vec<SegmentId>)> = Vec::new();

        'merge_segment_loop: for segment in segment_metas {
            let segment_id = segment.id();
            let num_docs = segment.num_docs();

            for group in merge_candidates.iter_mut() {
                if group.0 + num_docs < self.target_docs_per_segment {
                    group.1.push(segment_id);
                    continue 'merge_segment_loop;
                }
            }

            if merge_candidates.len() < 1 {
                merge_candidates.push((num_docs, vec![segment_id]));
            }
        }

        merge_candidates
            .into_iter()
            .map(|group| MergeCandidate(group.1))
            .collect::<Vec<MergeCandidate>>()
    }
}

When using the MergeWheneverPossiblePolicy, then segment_metas argument contains between 0-2 items. When 2 items are passed, a merge happens. It works.

But when I use the MarlinPolicy, then segment_metas argument always contains between 0-1 item. There are never 2 items passed down to the MarlinPolicy as they are to the MergeWheneverPossiblePolicy.

I'd be inclined to believe there is something wrong with my code, but I don't really see how I could have possibly cause Tantivy to never pass more then 1 item to the compute_merge_candidates method.

It's quite perplexing and I have no idea what's going on... 😆

fulmicoton commented 1 month ago

I totally forgot about this. Can you check if the merge operations are executed/ finish successfully/ get rejected when the merge finishes etc.?

patrik-simunic-cz commented 1 month ago

@fulmicoton I will tell you in the morning (mine 😅). This has caught my attention - first think tomorrow I will create a reproducible demo testing various scenarios and share it here.

As a side note: it might be worth not calling the MergePolicy's compute_merge_candidates at all and exit the consider_merge_options prematurely if there are less then 2 mergeable segments. It seems like unnecessary overhead (couple extra allocations and function calls) - true, really tiny overhead, but still.

Btw. awesome work, Tantivy is amazing! I still can't believe (after merging the segments) how blazingly fast the queries are even with IO over the network (double ms digits - constantly) and how little memory the Lambda consumes. ❤️

patrik-simunic-cz commented 1 month ago

@fulmicoton Alright. Apologies for the the late response. As far as I understand your question, I have not encountered any merging operation errors/rejections (per se), however:

First thing on Thursday, I wrote a program to test different conditions and started the execution. We then had an inter-company party and when I came back some 12h later, one of the program variations was still running. It seems an infinite loop was somehow created.

Reproducible demo

You can find the reproducible demo here: https://github.com/testuj-to/tantivy-merge-policy-demo It contains different variations, stats from these runs, observations and some conclusions.

~What I think is going on:~

PSeitz commented 1 month ago

I think the issue here is to try to merge single segments although there is nothing to do (no deletes). So they get marked as in merge, while the merge does nothing. In the next iteration they may still be in the merge pipeline and not considered for merge.

patrik-simunic-cz commented 1 month ago

I've run all the runs from the demo against the commit with a proposed fix and now all seems to be working like a clockwork. No suspected infinite loops detected, the merge policy was always called with either 0 or 2 candidates and every run had at least 1 call with 2 candidates => the merge policy took effect in every single case.

The demo repo has been updated to contain stats from all the runs against the commit with a proposed fix.

I'll reserve from making rush conclusions next time 😅