Open afbarbaro opened 5 months ago
Hi @afbarbaro 😄
This is the old problem of how stemmers don't always do what we think they do.
Here's a small program I made to test queries quickly:
from tantivy import SchemaBuilder, Index, Document
schema = SchemaBuilder().add_text_field("body", stored=True, tokenizer_name="en_stem").build()
index = Index(schema=schema, path=None)
writer = index.writer(heap_size=15_000_000, num_threads=1)
doc = Document()
doc.add_text("body", "lovely freshness older")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Australian monarchy")
writer.add_document(doc)
doc = Document()
doc.add_text("body", "Titanic sinks")
writer.add_document(doc)
writer.commit()
index.reload()
searcher = index.searcher()
def find(query_text: str):
q = index.parse_query(query_text, ["body"])
if hits := searcher.search(q).hits:
for _, doc_address in hits:
doc = searcher.doc(doc_address)
print(f"{query_text} hit: {doc['body']}")
else:
print(f"{query_text} not found")
# Run with `python -i main.py`
The idea is to run it with python -i main.py
, so that when the script ends it leaves you in an interpreter where you can test using the find()
function.
Indeed, as you saw, monarchy
does not stem to monarch:
$ python -i main.py
>>> find('monarch')
monarch not found
>>> find('monarchs')
monarchs not found
>>> find('monarchy')
monarchy hit: ['Australian monarchy']
However, you'll see that if we search for Australians
(with an "s"), which was not the term indexed, we'll get a hit:
>>> find('Australians')
Australians hit: ['Australian monarchy']
This means that tantivy-py is indeed stemming the query text. I checked the code and the query parsers does make use of the tokenizers registered on the fields.
Stemming can be very surprising, for example:
>>> find('monarchies')
monarchies hit: ['Australian monarchy']
Did you expect monarchies
to work? The reason this happens is because the shared stem of both monarchy
and monarchies
is in fact monarchi, which you can verify on this demo page for the snowball stemmer: https://snowballstem.org/demo.html
In my test I do get a hit using Titan
so I'm not sure why that doesn't work for you:
>>> find('Titan')
Titan hit: ['Titanic sinks']
>>> find('Titans')
Titans hit: ['Titanic sinks']
>>> find('Titanics')
Titanics hit: ['Titanic sinks']
>>> find('titanics')
titanics hit: ['Titanic sinks']
Sometimes stemming can be very frustrating. For example, in the first document you would think older
should stem to old
but this doesn't happen:
>>> find('old')
old not found
>>> find('older')
older hit: ['lovely freshness older']
Even oldest
doesn't stem back to older
:
>>> find('oldest')
oldest not found
This is because the stem for older
is just older
.
@cjrh thanks so much for the detailed explanation. I know what is the difference between your code and example and what I was doing: you're using index.parse_query(query_text)
and I was building the query using this:
subqueries = [(Occur.Should, Query.term_query(index.schema, 'body', term)) for term in terms]
query = tantivy.Query.boolean_query(subqueries)
so I see that the stemming is indeed applied to the query text when using index.parse_query
but it is not applied to the text given to Query.term_query
. Does this make sense? And is this by design?
One of the reasons why I was constructing the query myself is because in the Python version of index.parse_query
you cannot pass the additional arguments available in the Rust true implementations - particularly the map for how to deal with Fuzzy search for a given field.
So I guess my question now is:
parse_query
?term_query
stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query
.Again, thank you for your help!
Also, for this:
should term_query stem? if not, should there be a way to access the tokenizer for the field and I can invoke it before giving the term to term_query.
There is a hack that I can use at least temporarily to get the stemmed terms by running parsed = index.parse_query(query_text)
then cast the parsed
variable to a str
and get the stemmed words for each term query inside that. Pretty nasty, but gets me unblocked.
so I see that the stemming is indeed applied to the query text when using index.parse_query but it is not applied to the text given to Query.term_query. Does this make sense? And is this by design?
Ah I see. To answer your question, this is not by design. Tantivy-py has been built by a fairly large number of volunteers and drive-by contributors over the years so there is relatively little that is specifically "designed". Earlier on we wanted to avoid adding all the fine-grained query classes (and other classes) and just have the parse_query
function, but in the end this proved to be too limiting and so recently we started adding those classes in.
I think more people will come across this now that we have the Query
class, so we'll need some kind of answer for how to do stemming on the query here. If you were using the upstream tantivy in Rust, what would the approach be? As a general guiding principle I'd tantivy-py to be a recognisable wrapper of tantivy so that familiarity with one allows famiiliarity with the other. Python in general has had great success with this approach of wrapping C libraries over the decades.
should term_query stem?
Maybe.
It wraps TermQuery
, so I'd want to know how stemming is handled in tantivy first, before deciding on a path for us.
We do have the fuzzy map in parse_query
, at least in main:
Thanks @cjrh . The reason I didn't this these additional args being there is because the Python bindings are outdated.
Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.
Yes please that would be great if you can update those type annotations.
Are these created by hand or by some process? Maybe fixing this is a simple PR I can contribute.
Yes created by hand. It would be a simple PR.
I'm testing
tantivy-py
, which I'm finding pretty great. However, I bumped into what seems to be an issue with thePython
package: it seems that term queries are not tokenized when using thesearcher.search(query, ..)
method, so I can't really use theen_stem
tokenizer (since it's not exposed for me to tokenize the query, only the indexing of documents).I'm testing
tavinty-py
with the Simple Wikipedia Example Set fromCohere
and here's what I see with a few sample queries:Is this a "feature" or a "bug"? I don't mind tokenizing the query myself before calling the
search
method, but tokenizers are not exposed in thePython
bindings.Any suggestions?