simonw / sfms-history

The sfms-history project
https://sfms-history.vercel.app
6 stars 1 forks source link

Try Matt's soundex() trick #12

Open simonw opened 2 years ago

simonw commented 2 years ago

Suggestion from @mattb: our OCRd text has lots of flaky badly spelled words in. Try indexing a version of it that consists of the words run through soundex() - then try a search implementation that runs soundex() against the search terms.

mattb commented 2 years ago

seems like this might be a more powerful transform to try if an initial soundex version shows promise: https://www.sqlite.org/spellfix1.html

simonw commented 2 years ago

Could try this: https://github.com/karlb/sqlite-spellfix

simonw commented 2 years ago

Interesting challenge with soundex: is there a smart way to display highlighted snippets?

Doing so would require somehow mapping back from the soundex indexed text to the plain text, which seems difficult.

Easiest alternative: don't offer search match snippets in "fuzzy" (aka soundex) mode.

simonw commented 2 years ago

Here's how I ended up creating a table of words for spellfix1:

insert into spell(word) select distinct lower(json_extract(value, '$.word')) from pages, json_each(regexp_matches(
    '(?P<word>\w+)(?P<b>\b)',
    text
))

Using datasette-rure.