mchaput / whoosh

Pure-Python full-text search library
Other
569 stars 69 forks source link

Weird behavior of Sequence query with wildcards #49

Open dbealthy opened 10 months ago

dbealthy commented 10 months ago

Problem

I am trying to implement behaviour similar to sphinx search engine handling phrases with wildcards. For this i use whoosh library. But when i use sequence queries with short words (2 chars length) and wildcards i get an error:

    289     return [Span(pos) for pos in self.value_as("positions")]
    290 else:
--> 291     raise Exception("Field does not support spans")

Exception: Field does not support spans

I noticied this happens when i add a lot of documents to the index, it doesn't happn with small number of documents though.

I want be able to search with queries like:

Here "по мест проживания" causes the error. When i reduce it to "по" it runs well, if i change it a bit to "по дороге" i am still getting the same error.

Code and expected results

from whoosh.fields import Schema, TEXT, NUMERIC
from whoosh.qparser import QueryParser, PhrasePlugin, SequencePlugin, OperatorsPlugin
from whoosh import analysis
from whoosh.filedb.filestore import RamStorage

analyzer = analysis.StandardAnalyzer(minsize=None, stoplist=None)
schema = Schema(item_id=NUMERIC(stored=True, bits=64), type=NUMERIC(stored=True), content=TEXT(analyzer=analyzer, stored=True, phrase=True))
storage = RamStorage()

ix = storage.create_index(schema)
writer = ix.writer()

with get_db() as db:
    for item in db["items"][0:500]:
        writer.add_document(
            item_id=item["id"], type=item["type"], content=item["content"]
        )
writer.commit(optimize=True)

parser = QueryParser("content", schema=schema)
op = OperatorsPlugin(
    And="AND", Or="OR", AndNot="ANT", Not=None, AndMaybe=None, Require=None
)
parser.remove_plugin_class(PhrasePlugin)
parser.add_plugin(SequencePlugin())
parser.replace_plugin(op)

with ix.searcher() as searcher:
    query = '"найденный" AND ("по* мест* проживания" OR "рядом с домом")'
    query = parser.parse(query, debug=True)
    hits = searcher.search(query, terms=True, limit=None)
    pprint(list(hits))

Expecting to get a list of hits but I am getting the Exception: Field does not support spans instead.

My content is text of variable length in different languages. Queries are also might be in different languages.