quickwit-oss / tantivy-py

Python bindings for Tantivy
MIT License
275 stars 63 forks source link

[feature request] Adding boolean query method #240

Closed alex-au-922 closed 5 months ago

alex-au-922 commented 5 months ago

The existing boolean query feature could be done from the index.parse_query, as long as we type the correct characters like +, - for must and must_not respectively.

However, there could be cases that users would like to create their inner query dynamically, or for the sake readability that they would like a container for their other query types like FuzzyTermQuery and PhraseQuery.

Currently the rust tantivy package allows creating the boolean query from the Struct tantivy::query::BooleanQuery. Will tantivy-py also have the boolean_query staticmethod for the Query class?

cjrh commented 5 months ago

However, there could be cases that users would like to create their inner query dynamically, or for the sake readability that they would like a container for their other query types like FuzzyTermQuery and PhraseQuery.

I agree with you that parse_query covers a lot of ground, using the tantivy query language. With my maintainer hat on, I see that as less code in tantivy-py, compared to adding extra explicit query types. However, the request does come up a fair bit and so I was wondering whether you could describe a specific use case here?

alex-au-922 commented 5 months ago

Sure, let's use a common ground for easier discussion, consider the following elasticsearch query:

{
    "query": {
        "bool": {
            "must": [
                {
                    "dis_max": {
                        "queries": [
                            {
                                "match": {
                                    "title": {
                                        "query": "sea whale",
                                        "boost": 2
                                    }
                                }
                            },
                            {
                                "match": {
                                    "body": {
                                        "query": "white dog",
                                        "boost": 1.5
                                    }
                                }
                            }
                        ],
                        "tie_breaker": 0.3
                    }
                }
            ]
        }
    }
}

The current parse_query method is impossible to construct this query as the tantivy query language currently cannot parse other query types say regex or disjunction max queries. However, this functionality is available in Rust's BooleanQuery and PyLucene's equivalent method.

For tantivy-py's case, we might consider the following function signature:

class Query:
    ...
    @staticmethod
    def boolean_query(subqueries: Iterator[tuple[Occur, Query]]) -> Query:
        ...

This requires the introduction of Occur enum in tantivy rust package.

The above elasticsearch syntax can be then transformed to:

Query.boolean_query(
    [
        (
            Occur.MUST,
            Query.dis_max_query(
                [
                    Query.phrase_query("title", "sea whale", boost=2),
                    Query.phrase_query("body", "white dog", boost=1.5)
                ],
                tie_breaker=0.3
            )
        )
    ]
)

which providers 3 benefits to developers:

  1. Easier syntax (compared to tantivy's query language and even elasticsearch) and better maintainability.
  2. Easier unit-testing the syntax as we are building the query block by block, instead of crumbling them into a single query string.
  3. Enable other currently not-supported query types in tantivy's query language, e.g. ConstScoreQuery, DisjunctionMaxQuery, ExistsQuery.
cjrh commented 5 months ago

Thanks for taking the time to write it out. You've explained it well 👍🏼

We are currently tracking progress on wrapping these query types in this comment in #20. I see BooleanQuery is already there along with the disjunction max and the regex query.

alex-au-922 commented 5 months ago

Added pull request for the implementation #243

cjrh commented 5 months ago

PR has been merged, thanks!