mchaput / whoosh

Pure-Python full-text search library
Other
569 stars 69 forks source link

Inconsistent behavior of proximity search (`~N`) in Whoosh based on word type #48

Open bbernicker opened 12 months ago

bbernicker commented 12 months ago

Description: The proximity search (~N) in Whoosh shows inconsistent behavior based on the nature of the words in the indexed document. It seems that individual letters and commonly used filler words might be disregarded, whereas semantically meaningful words are counted.

Expected Behavior: A search for "hello world"~2 should match strings where "hello" and "world" are separated by up to two terms, regardless of the nature or semantic value of the intervening terms.

Actual Behavior: The behavior of the proximity search appears inconsistent:

  1. It matches strings like "hello X Y Z A B C D world" and "hello to a the but and for this world" even though there are many terms between "hello" and "world".
  2. It does not match strings like "hello add more words to illustrate the problem world", correctly following the ~2 constraint.

Minimal Working Example (MWE):

from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
from whoosh.qparser import QueryParser
from whoosh.filedb.filestore import RamStorage

schema = Schema(content=TEXT(stored=True))
storage = RamStorage()

def create_new_index():
    return storage.create_index(schema)

def add_to_index(idx, content):
    writer = idx.writer()
    writer.add_document(content=content)
    writer.commit()

def matches_whoosh(query, indexed_opinion):
    with indexed_opinion.searcher() as searcher:
        parsed_query = QueryParser("content", indexed_opinion.schema).parse(query)
        results = searcher.search(parsed_query)
        return len(results) > 0

# Add test cases and print results
test_cases = {
    "Case 1": "hello X Y Z A B C D world",
    "Case 2": "hello to a the but and for this world",
    "Case 3": "hello add more words to illustrate the problem world"
}

query = '"hello world"~2'
for case_name, content in test_cases.items():
    idx = create_new_index()
    add_to_index(idx, content)
    print(f"{case_name}: {matches_whoosh(query, idx)}")

Environment: