Disallow near-identical words

jonadsimon / wonder-words-generator

Generates WonderWords puzzles

Apache License 2.0

2 stars 0 forks source link

Disallow near-identical words #4

Closed jonadsimon closed 2 years ago

jonadsimon commented 2 years ago

Don't want multiple near-identical words appearing in the word search

Already take one step in this direction by disallowing words that are superstrings of other words

However this does not capture cases of subtly different spelling e.g. color/colour, ebonise/ebonize, etc

Should add a Levenshtein-distance metric enforcing edit distance be > 1

jonadsimon commented 2 years ago

Not enough to limit to Levenshtein-distance > 1, need to also exclude words with share too-high proportion of their letters (or something equivalent to this)

Nearest neighbors for oligodendrocyte include:

oligodendrocytes : excluded by substring-matching only if we include seed word
oligodendrocytic : not excluded
Oligodendrocyte : excluded by downcasing equivalence with oligodendrocyte
oligodendroglial : excluded as superstring with oligodendroglia
oligodendrogenesis : not excluded
oligodendroglia : not excluded

jonadsimon commented 2 years ago

Fixed for simple cases via addition of Levenstein constraint + 70% matching constraint. However still wouldn't for the long suffixes on "oligodendrocytes". Marking as fixed until otherwise noted

jonadsimon commented 2 years ago

Still has issues with near-identical short words: "marked" vs "marking", "time" vs "timing"

Use an out-of-the-box stemmer to identify base words and remove duplicates that way

jonadsimon commented 2 years ago

Add word-length constrain to Levenstein distance to only apply to words of length ≥5. Currently it's misfiring on e.g. "time" vs "tide"

jonadsimon commented 2 years ago

Added Levenstein min-length and stemming