Term Query is not tokenized (?)

afbarbaro commented 5 months ago

I'm testing tantivy-py, which I'm finding pretty great. However, I bumped into what seems to be an issue with the Python package: it seems that term queries are not tokenized when using the searcher.search(query, ..) method, so I can't really use the en_stem tokenizer (since it's not exposed for me to tokenize the query, only the indexing of documents).

I'm testing tavinty-py with the Simple Wikipedia Example Set from Cohere and here's what I see with a few sample queries:

Australia monarchy --> no good hits unless I change it to Australia monarch
Titanic sink --> no good hits unless I change it to Titan sink

Is this a "feature" or a "bug"? I don't mind tokenizing the query myself before calling the search method, but tokenizers are not exposed in the Python bindings.

Any suggestions?

cjrh commented 5 months ago

Hi @afbarbaro 😄

This is the old problem of how stemmers don't always do what we think they do.

Here's a small program I made to test queries quickly:

from tantivy import SchemaBuilder, Index, Document
schema = SchemaBuilder().add_text_field("body", stored=True, tokenizer_name="en_stem").build()
index = Index(schema=schema, path=None)
writer = index.writer(heap_size=15_000_000, num_threads=1)
doc = Document()
doc.add_text("body", "lovely freshness older")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Australian monarchy")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Titanic sinks")
writer.add_document(doc)
writer.commit()
index.reload()

searcher = index.searcher()

def find(query_text: str):
    q = index.parse_query(query_text, ["body"])
    if hits := searcher.search(q).hits:
        for _, doc_address in hits:
            doc = searcher.doc(doc_address)
            print(f"{query_text} hit: {doc['body']}")
    else:
        print(f"{query_text} not found")

# Run with `python -i main.py`

The idea is to run it with python -i main.py, so that when the script ends it leaves you in an interpreter where you can test using the find() function.

Indeed, as you saw, monarchy does not stem to monarch:

$ python -i main.py
>>> find('monarch')
monarch not found
>>> find('monarchs')
monarchs not found
>>> find('monarchy')
monarchy hit: ['Australian monarchy']

However, you'll see that if we search for Australians (with an "s"), which was not the term indexed, we'll get a hit:

>>> find('Australians')
Australians hit: ['Australian monarchy']

This means that tantivy-py is indeed stemming the query text. I checked the code and the query parsers does make use of the tokenizers registered on the fields.

Stemming can be very surprising, for example:

>>> find('monarchies')
monarchies hit: ['Australian monarchy']

Did you expect monarchies to work? The reason this happens is because the shared stem of both monarchy and monarchies is in fact monarchi, which you can verify on this demo page for the snowball stemmer: https://snowballstem.org/demo.html

In my test I do get a hit using Titan so I'm not sure why that doesn't work for you:

>>> find('Titan')
Titan hit: ['Titanic sinks']
>>> find('Titans')
Titans hit: ['Titanic sinks']
>>> find('Titanics')
Titanics hit: ['Titanic sinks']
>>> find('titanics')
titanics hit: ['Titanic sinks']

Sometimes stemming can be very frustrating. For example, in the first document you would think older should stem to old but this doesn't happen:

>>> find('old')
old not found
>>> find('older')
older hit: ['lovely freshness older']

Even oldest doesn't stem back to older:

>>> find('oldest')
oldest not found

This is because the stem for older is just older.

afbarbaro commented 5 months ago

@cjrh thanks so much for the detailed explanation. I know what is the difference between your code and example and what I was doing: you're using index.parse_query(query_text) and I was building the query using this:

subqueries = [(Occur.Should, Query.term_query(index.schema, 'body', term)) for term in terms]
query = tantivy.Query.boolean_query(subqueries)

so I see that the stemming is indeed applied to the query text when using index.parse_query but it is not applied to the text given to Query.term_query. Does this make sense? And is this by design?

One of the reasons why I was constructing the query myself is because in the Python version of index.parse_query you cannot pass the additional arguments available in the Rust true implementations - particularly the map for how to deal with Fuzzy search for a given field.

So I guess my question now is:

can I get the additional arguments for parse_query?
should term_query stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query.

Again, thank you for your help!

afbarbaro commented 5 months ago

Also, for this:

should term_query stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query.

There is a hack that I can use at least temporarily to get the stemmed terms by running parsed = index.parse_query(query_text) then cast the parsed variable to a str and get the stemmed words for each term query inside that. Pretty nasty, but gets me unblocked.

cjrh commented 5 months ago

so I see that the stemming is indeed applied to the query text when using index.parse_query but it is not applied to the text given to Query.term_query. Does this make sense? And is this by design?

Ah I see. To answer your question, this is not by design. Tantivy-py has been built by a fairly large number of volunteers and drive-by contributors over the years so there is relatively little that is specifically "designed". Earlier on we wanted to avoid adding all the fine-grained query classes (and other classes) and just have the parse_query function, but in the end this proved to be too limiting and so recently we started adding those classes in.

I think more people will come across this now that we have the Query class, so we'll need some kind of answer for how to do stemming on the query here. If you were using the upstream tantivy in Rust, what would the approach be? As a general guiding principle I'd tantivy-py to be a recognisable wrapper of tantivy so that familiarity with one allows famiiliarity with the other. Python in general has had great success with this approach of wrapping C libraries over the decades.

cjrh commented 5 months ago

should term_query stem?

Maybe.

It wraps TermQuery, so I'd want to know how stemming is handled in tantivy first, before deciding on a path for us.

cjrh commented 5 months ago

We do have the fuzzy map in parse_query, at least in main:

afbarbaro commented 5 months ago

Thanks @cjrh . The reason I didn't this these additional args being there is because the Python bindings are outdated.

https://github.com/quickwit-oss/tantivy-py/blob/e3de7b1aab90aacae04d87fcb0ad62c3e895f508/tantivy/tantivy.pyi#L364

Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.

cjrh commented 5 months ago

Yes please that would be great if you can update those type annotations.

cjrh commented 5 months ago

Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.

Yes created by hand. It would be a simple PR.

quickwit-oss / tantivy-py

Term Query is not tokenized (?) #296