Description:
The proximity search (~N) in Whoosh shows inconsistent behavior based on the nature of the words in the indexed document. It seems that individual letters and commonly used filler words might be disregarded, whereas semantically meaningful words are counted.
Expected Behavior:
A search for "hello world"~2 should match strings where "hello" and "world" are separated by up to two terms, regardless of the nature or semantic value of the intervening terms.
Actual Behavior:
The behavior of the proximity search appears inconsistent:
It matches strings like "hello X Y Z A B C D world" and "hello to a the but and for this world" even though there are many terms between "hello" and "world".
It does not match strings like "hello add more words to illustrate the problem world", correctly following the ~2 constraint.
Minimal Working Example (MWE):
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
from whoosh.qparser import QueryParser
from whoosh.filedb.filestore import RamStorage
schema = Schema(content=TEXT(stored=True))
storage = RamStorage()
def create_new_index():
return storage.create_index(schema)
def add_to_index(idx, content):
writer = idx.writer()
writer.add_document(content=content)
writer.commit()
def matches_whoosh(query, indexed_opinion):
with indexed_opinion.searcher() as searcher:
parsed_query = QueryParser("content", indexed_opinion.schema).parse(query)
results = searcher.search(parsed_query)
return len(results) > 0
# Add test cases and print results
test_cases = {
"Case 1": "hello X Y Z A B C D world",
"Case 2": "hello to a the but and for this world",
"Case 3": "hello add more words to illustrate the problem world"
}
query = '"hello world"~2'
for case_name, content in test_cases.items():
idx = create_new_index()
add_to_index(idx, content)
print(f"{case_name}: {matches_whoosh(query, idx)}")
Description: The proximity search (
~N
) in Whoosh shows inconsistent behavior based on the nature of the words in the indexed document. It seems that individual letters and commonly used filler words might be disregarded, whereas semantically meaningful words are counted.Expected Behavior: A search for
"hello world"~2
should match strings where "hello" and "world" are separated by up to two terms, regardless of the nature or semantic value of the intervening terms.Actual Behavior: The behavior of the proximity search appears inconsistent:
~2
constraint.Minimal Working Example (MWE):
Environment: