Searching in a subset of the documents / SetMembership Query

quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

MIT License

11.33k stars 621 forks source link

Searching in a subset of the documents / SetMembership Query #658

Open brainlock opened 4 years ago

brainlock commented 4 years ago

Hi,

I'm working on the possibility of searching on a subset (only known at runtime) of the documents. Think searching (with facets) for example in the set of items already bought in the past, in a situation where I can't add this information to the index, but I get from the application, at query time, a set of documents to search in.

To identify the documents I can use a field which is the primary key on the application's side, and of type u64, indexed, stored as a fast field in the tantivy index.

My baseline implementation is good enough while we work on the API and are still integrating the library: it just adds a bunch of terms to the generated boolean query. I should add, I'm interfacing with tantivy programmatically, so I don't have any concern about the QueryParser. I was already generating a query programmatically before, now it just contains more terms when the application wants to restrict the search to a subset.

Preliminary question: have I missed some other way to achieve this with the current API?

The size of this constraint varies quite a lot. It's sometimes as small as 10 or 20 IDs, while it can get in the thousands for some use cases.

When generating thousands of boolean query terms, things get slower than they could be :-) at a first cursory glance, a lot of time is spent in generating the combined scorer for all these query terms (complex_scorer method, if I recall).

A way I can see to make this faster: instead of generating so many query terms, I tried implementing a SetMembershipQuery, which is very similar to RangeQuery. It carries the set of constraint values, and its associated Weight adds all the matching documents from a segment into a BitSet and returns a ConstScorer, just like RangeQuery.

This ends up being way way faster, even with up to 10000 constraint values, which is absolutely enough for my use case.

Question: do you think this would be a sensible addition to tantivy itself?

If no, in order to ship this, I would need a few types that are ATM not public (BitSetDocSet and BitSet for sure, ConstScorer is I think already public).

If yes, I can continue working on this and make a PR, but it will first require some discussion and some benchmarking. In particular, I experimented with two ways of implementing SetMembershipQuery: one which iterates over the valid values and uses the inverted index to fetch the matching docs, like RangeQuery. Another way is walking over the full set of doc_ids_alive() in a segment, fetching the value (using a fast field reader) and checking if it's in the allowed set. The former seems faster, but I'm not sure what happens when the n. of values in the constraint goes up a lot. At some point I would expect the latter method to be faster, but I haven't measured this at all yet, so it's just guessing.

Any thoughts?

Thank you in advance!

Albe

fulmicoton commented 4 years ago

Hi,

I'm working on the possibility of searching on a subset (only known at runtime) of the documents. Think searching (with facets) for example in the set of items already bought in the past, in a situation where I can't add this information to the index, but I get from the application, at query time, a set of documents to search in.

To identify the documents I can use a field which is the primary key on the application's side, and of type u64, indexed, stored as a fast field in the tantivy index.

My baseline implementation is good enough while we work on the API and are still integrating the library: it just adds a bunch of terms to the generated boolean query. I should add, I'm interfacing with tantivy programmatically, so I don't have any concern about the QueryParser. I was already generating a query programmatically before, now it just contains more terms when the application wants to restrict the search to a subset.

Preliminary question: have I missed some other way to achieve this with the current API?

Yes this is probably rather inefficient.

A way I can see to make this faster: instead of generating so many query terms, I tried implementing a SetMembershipQuery, which is very similar to RangeQuery. It carries the set of constraint values, and its associated Weight adds all the matching documents from a segment into a BitSet and returns a ConstScorer, just like RangeQuery.

This ends up being way way faster, even with up to 10000 constraint values, which is absolutely enough for my use case.

Question: do you think this would be a sensible addition to tantivy itself?

This could be interesting yes. One way to make it generic would be MaterializetoBitSetQuery wrapper that takes another query and materializes it entirely into a BitSet.

If no, in order to ship this, I would need a few types that are ATM not public (BitSetDocSet and BitSet for sure, ConstScorer is I think already public).

We can make those public, but I'd rather expose them as a MaterializetoBitSetQuery as descrbied above.

Another possible alternative would be to use a fast field. You could store your doc ids as a fast field (this has a ram cost of number of docs x minimal number of bits used to represent your ids), and filter on that. Filtering can be done as a query wrapper or at the collector level.

You would then lookup your ids in a hashset, or if they are small, in a bitset. The right solution is really a matter of the cardinalities at stake here. How many documents get dropped due to this, and how many terms we are filtering on.

There is probably a sweet spot where it is faster to ask for more than the top K results and postfilter the results that are irrelevant.

brainlock commented 4 years ago

Hi @fulmicoton,

thanks for your reply. I 'm not sure I see a way of materializing a query directly, as query and weight objects are disconnected, and the actual work of filtering happens in the Weight. So in a scheme where the Query is instead responsible for filtering valid values, I'd need to keep a reference to the Query from the Weight object. Did you mean we could expose a MaterializedWeight, instead?

This would implement the scorer method, i.e. it would get a &SegmentReader and would have to produce the valid matches for that segment.

My next question then would be how could we return those matches appropriately. RangeWeight and AutomatonWeight both use a TermStreamer, but a generalized way would be to allow returning the matching DocIds directly. Maybe in a BitSet?

(Aside: RangeWeight and AutomatonWeight could also be refactored to use this common materialization strategy)

fulmicoton commented 4 years ago

thanks for your reply. I 'm not sure I see a way of materializing a query directly, as query and weight objects are disconnected, and the actual work of filtering happens in the Weight. So in a scheme where the Query is instead responsible for filtering valid values, I'd need to keep a reference to the Query from the Weight object. Did you mean we could expose a MaterializedWeight, instead?

This would implement the scorer method, i.e. it would get a &SegmentReader and would have to produce the valid matches for that segment.

I am thinking of implementign both MaterializedQuery and a MaterializedWeight yes. The MaterializedWeight does not actually need to be pub, since MaterializedQuery needs to return a boxed trait.

pub struct MaterializedQuery {
   wrapped: Box<dyn Query>
}

pub(crate) struct MaterializedWeight {
   wrapped: Box<dyn Weight>
}

All the work is done in the MaterializeWeight, and produces a BitSetDocSet or something similar that gets some score information.

(Aside: RangeWeight and AutomatonWeight could also be refactored to use this common materialization strategy)

I did not investigate much but yes I think this is correct.

brainlock commented 4 years ago

The more I think about this, the less sure I am that such query optimization should be handled outside of tantivy.

Introducing a MaterializedWeight would mean that the user of tantivy has to detect such optimization opportunities, in my case replacing a big AND(f:x OR f:y OR f:y OR ...) with a materialized BitSetDocSet.

Alternatively, tantivy itself could do this optimization: the scorer method for BooleanWeight is where things get slow when working with thousands of small subqueries. We could extend the complex_scorer() method with the ability to detect this pattern and replace it with a materialized docset, wrapped in a constant scorer.

The pattern is (Occur::Must, BooleanWeight(subclauses)), where subclauses is made entirely of (Occur::Should, TermWeight) and is big enough that this becomes worth it. Additionally, all the TermWeights have index_record_option=IndexRecordOption::Basic. (i.e. the user is only interested in whether there is a match or not, if I'm reading this right).

Its replacement would be (Occur::Must, MaterializedWeight) where the MaterializedWeight::scorer builds a ConstantScorer(BitSetDocSet) containing all the docs in the segment that matched one of the terms in the original pattern.

Cons:

the cost of searching for the pattern would be paid by every query, though I can't estimate without measuring how big it would be in practice. Maybe it can be made small enough that it's not relevant.
"big enough that this becomes worth it": how do we determine this? At the beginning a hardcoded number would maybe be good enough, but this begs the question of we can compare the cost of the many-small-term-queries with the cost of the materialized weight.
there are many conditionals in my description of optimization above: a specific pattern, "is big enough", the caller is only interested in detecting whether the document contains the term or not. Maybe I'm the only one that actually needs this optimization :-)

One option would be to leave aside the question of "is big enough" and always replace a disjunction of TermWeight with the materialized version.

In the end, it really boils down to where you want to draw the line between tantivy and its user code: do you think tantivy should try to optimize a case like this, or should the user code?

In my particular use case, I can tell you that it's very easy to detect and optimize this on my side, so from my perspective it's not worth complicating tantivy for this (as long as I have a way to implement a materialized weight/scorer somehow :-)).

I thought it was worth discussing this before spending more time optimizing this case, but feel free to ignore this comment if it's out of the scope of tantivy as a library.

fulmicoton commented 4 years ago

@brainlock Sorry I have been very sloppy yes. Thank you so much for allocating brain time to this. You are absolutely right.

You can check out, scorer_union in boolean_weight.

https://github.com/tantivy-search/tantivy/blob/462774b15cdb81860d070c883d5b140884ff95db/src/query/boolean_query/boolean_weight.rs#L17

There is already a some special case when there is only one term (of course). I am not entirely sure what the right heuristics is, but I think you can just add a special case for when there are (> N terms) or so.

I suspect N = 8 is an ok boundary.

Do not necessarily expect a super major performance boost however. The way union is computed by default is similar but works over a small horizon : we happen all of the legs to a bitset that goes from [0..130_000 docs]. For a segment smaller than 130K docs the difference -in theory- the difference should be really tiny.

For larger segments and a high number of terms, it may or may not be useful. Can you implement it and report if you observe a benefit on your use case? Do you need help?

brainlock commented 4 years ago

Thanks for the pointer to scorer_union! I wasn't sure how to deal with trait objects when I want to detect a specific concrete type... that's easier than I thought :)

OK cool, I tried to implement this, and realized that we can do it at the Weight level, producing a specialized Scorer, or maybe at the Query level, producing a specialized MaterializedTermQuery.

I pushed a proof of concept here which does it on the BooleanQuery (but the same can be done in a very similar way at the Weight level I think): https://github.com/brainlock/tantivy/commit/dcc21b80fa7441c613ed9184c0368b51ead89054

This detects (Occur::Must, BooleanQuery(... all Occur::Should, TermQuery with index_record_option == IndexRecordOption::Basic))? and replaces them with a MaterializedTermQuery.

Warning: code quality is very proof of concept. But it's enough to show the idea and to verify that it does indeed have a huge impact on my use case.

If we go through with this I'd like to also add a proper benchmark, but for now this is my very unscientific result:

baseline: query which does full text search, + a term like some_attribute=12, and collects lots of facets, which takes ~2ms on my machine.
same query as above, plus a restriction to search on a set of 3000 product ids, implemented as 3000 (Occur::Should, prodid=x) in a boolean query, without optimization: ~70ms
same query as above, plus a restriction to search on a set of 3000 product ids, implemented as 3000 (Occur::Should, prodid=x) in a boolean query, with the optimization which detects the 3000 prodid terms and replaces them with a MaterializedTermQuery: ~8ms

8ms is absolutely enough for my use case.

Can you comment directly on the code if I don't make a draft PR?

If you don't want to include the query-rewriting I can easily do this on my side, all I need is MaterializedTermQuery or a way to implement it on my side.

brainlock commented 4 years ago

To sum up what we found while looking at my commit (here: https://github.com/brainlock/tantivy/commit/dcc21b80fa7441c613ed9184c0368b51ead89054#r35354810) and to revive the discussion of this issue:

it seems like we cannot automatically rewrite a union of TermQuery on the same field into a SetMembership query as thew two would not be equivalent w.r.t scoring: a union scorer would still score depending on what actually matched.

We need a way to determine that the user is not interesting in scoring at all: my attempt at it with is_match_only is not good enough, due to the characteristics of union scorers above.

Do you see a way for us to be able to infer this while looking at the query only?

If not, we could go the route taken by Lucene on the matter. In Lucene, there is a TermInSetQuery which is explicitly defined as being equivalent to a disjunction of Term queries but with a constant scorer. This allows the user to express the intent of only being interested in the filtering.

The first thing I would implement is a TermInSetQuery: this is absolutely good enough for my use case.

One further step would be to also introduce something that allows us to wrap any query and make it return a constant score (i.e. only check for "did it match or not"). This would be the MaterializedQuery you initially proposed.

Doing this would give the user more freedom: the same constraint (filter by "field X is in {a, b, c, d ...}") could then be expressed equivalently with a TermInSetQuery or with a MaterializedQuery(BooleanQuery(...)).

What are your thoughts on this?

Thanks!