Open simonw opened 2 years ago
seems like this might be a more powerful transform to try if an initial soundex version shows promise: https://www.sqlite.org/spellfix1.html
Could try this: https://github.com/karlb/sqlite-spellfix
Interesting challenge with soundex: is there a smart way to display highlighted snippets?
Doing so would require somehow mapping back from the soundex indexed text to the plain text, which seems difficult.
Easiest alternative: don't offer search match snippets in "fuzzy" (aka soundex) mode.
Here's how I ended up creating a table of words for spellfix1
:
insert into spell(word) select distinct lower(json_extract(value, '$.word')) from pages, json_each(regexp_matches(
'(?P<word>\w+)(?P<b>\b)',
text
))
Using datasette-rure
.
Suggestion from @mattb: our OCRd text has lots of flaky badly spelled words in. Try indexing a version of it that consists of the words run through
soundex()
- then try a search implementation that runssoundex()
against the search terms.