Closed jonadsimon closed 2 years ago
Not enough to limit to Levenshtein-distance > 1, need to also exclude words with share too-high proportion of their letters (or something equivalent to this)
Nearest neighbors for oligodendrocyte
include:
oligodendrocytes
: excluded by substring-matching only if we include seed wordoligodendrocytic
: not excludedOligodendrocyte
: excluded by downcasing equivalence with oligodendrocyte
oligodendroglial
: excluded as superstring with oligodendroglia
oligodendrogenesis
: not excludedoligodendroglia
: not excludedFixed for simple cases via addition of Levenstein constraint + 70% matching constraint. However still wouldn't for the long suffixes on "oligodendrocytes". Marking as fixed until otherwise noted
Still has issues with near-identical short words: "marked" vs "marking", "time" vs "timing"
Use an out-of-the-box stemmer to identify base words and remove duplicates that way
Add word-length constrain to Levenstein distance to only apply to words of length ≥5. Currently it's misfiring on e.g. "time" vs "tide"
Added Levenstein min-length and stemming
Don't want multiple near-identical words appearing in the word search
Already take one step in this direction by disallowing words that are superstrings of other words
However this does not capture cases of subtly different spelling e.g. color/colour, ebonise/ebonize, etc
Should add a Levenshtein-distance metric enforcing edit distance be > 1