mchaput / whoosh

Pure-Python full-text search library
Other
569 stars 69 forks source link

Spelling corrector breaks on NUMERIC fields #55

Open CodeOptimist opened 6 months ago

CodeOptimist commented 6 months ago

When a user typo's a numeric field in their query:

  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\searching.py", line 931, in correct_query
    return sqc.correct_query(q, qstring)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\spelling.py", line 327, in correct_query
    sugs = c.suggest(token.text, prefix=prefix, maxdist=maxdist)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\spelling.py", line 66, in suggest
    for item in _suggestions(text, maxdist, prefix):
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\spelling.py", line 111, in _suggestions
    for sug in reader.terms_within(sugfield, text, maxdist, prefix=prefix):
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\codec\base.py", line 364, in find_matches
    match = dfa.next_valid_string(term)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Chris\AppData\Local\Python\venv\document-search\Lib\site-packages\whoosh\automata\fsa.py", line 267, in next_valid_string
    for i, label in enumerate(string):
                    ^^^^^^^^^^^^^^^^^
TypeError: 'int' object is not iterable

If we look here:

https://github.com/mchaput/whoosh/blob/d9a3fa2a4905e7326c9623c89e6395713c189161/src/whoosh/searching.py#L914

This doesn't seem right. We're using ReaderCorrector as the default for all fields? SegmentReader.terms_within() uses Automata.terms_within() which is Levenshtein distance: https://github.com/mchaput/whoosh/blob/d9a3fa2a4905e7326c9623c89e6395713c189161/src/whoosh/codec/base.py#L376 and chokes on the int correctly returned from NUMERIC.from_bytes() received within W3FieldCursor: https://github.com/mchaput/whoosh/blob/d9a3fa2a4905e7326c9623c89e6395713c189161/src/whoosh/codec/whoosh3.py#L541

Pretty sure Levenshtein distance isn't meant to be supported on NUMERIC fields? Since they're stored like https://github.com/mchaput/whoosh/blob/d9a3fa2a4905e7326c9623c89e6395713c189161/src/whoosh/fields.py#L712 though maybe I'm the dumb one.

Shouldn't searching.py be something more like:

        # Fill in default corrector objects for fields that don't have a custom
        # one in the "correctors" dictionary
        from whoosh.fields import TEXT  # <-----
        for fieldname, field in self.schema.items():  # <-----
            fieldname = aliases.get(fieldname, fieldname)
            if isinstance(field, TEXT) and fieldname not in correctors:  # <-----
                correctors[fieldname] = self.reader().corrector(fieldname)

Anyway that's the fix I'm using for now. I only need corrections for text fields.