timpalpant / LittleBoxes

A crossword solver
GNU General Public License v3.0
1 stars 0 forks source link

Fuzzy and quantitative ClueDB lookup #4

Closed timpalpant closed 8 years ago

timpalpant commented 8 years ago

The ClueDB currently provides an API for getting potential answers of a given length for an exact clue match. The length is a hard constraint, but matching clues should not be.

A more general version would be useful that allows fuzzy / partial matches of some kind, and returns an estimated confidence/likelihood for each potential answer (based on the number of times that answer has occurred, the number of possible choices, the closeness of the clue match). Ideally this more general ClueDB would have a tuning parameter that controls the strictness of "matches", with one extreme being "exact matches only" (the current behavior) and the other extreme being "all words of the correct length".

Brute-force lookup for clues within a certain Levenshtein distance might work. This paper may or may not have some other useful ideas:

jpalpant commented 8 years ago

I might take this on next once I finish serialization - since I just wrote ClueDBSolver (and it works well for the exact-match case) a good next step would be to add near-matches.

  1. Find nearly-matching answers and evaluate how closely they match
  2. Use nearly-matching answers to solve crosswords, ignoring how well they matched and treating them all as perfect matches (this should happen almost automatically)
  3. Take into account the goodness of each possible answer when constructing crossword solutions
timpalpant commented 8 years ago

Here is one library for fuzzy matching: https://pythonhosted.org/ngram/ngram.html

but it uses an extra ~3 GB of memory for our ClueDB.

Others:

timpalpant commented 8 years ago

Closing for now. The fuzzy N-gram lookup is slow and needs improvement, but it's done once at the beginning so it's workable.

Another important follow-up will be to investigate whether the fuzzy lookup is better or worse than exact match at producing good solutions.