Cluedb dfs solve - Githubissues

jpalpant commented 8 years ago

Not actually a DFS-based solver anymore, but ClueDBSolver is now able to effectively find potential solutions using exact matches in the ClueDB it uses. It gives each solution a confidence of 1.0, but the solutions are returned sorted from best to worst in terms of number of clues with answers.

Generally, ClueDBSolver finds a LOT of solutions (hundreds to thousands), and it doesn't have a good way to determine how good a solution is since as far as it was concerned, every word it played was an exact match for the clue it was given. That could use some work. But for a first step it's quick and smart and reasonably memory efficient, finding it's thousands of solutions in less than a second on most of my tests.

Take a look! (You'll probably have to change solve_xword.py like I did. I also recommend puzzle daily-2015-11-27 for a puzzle with only a few solutions, and daily-2016-01-19 for a puzzle with many)

jpalpant commented 8 years ago

Slight problem with the solver - it's written to be exhaustive, so it takes every possible set of solutions to across clues and every possible set of solutions to down clues and takes the product of those two sets. If the ClueDB doesn't have many ideas on what to do for the clues, that's fine - but sometimes, it seems, it has lots of potential words:

INFO:root:Loading dictionary INFO:root:Solving puzzle INFO:littleboxes.solver.ClueDBSolver:Conflicts map build in 0.0759859085083 seconds INFO:littleboxes.solver.ClueDBSolver:Size of across fills: 13271040 INFO:littleboxes.solver.ClueDBSolver:Size of down fills: 640

(Puzzle daily-1994-04-07)

This one would have had 8,493,465,600 sets of words to check for conflicts and then for uniqueness. That's far too many, and who knows what other puzzles might have. I might need to do something about the fact that the time and space complexity expands with the product of the number of solutions the ClueDB returns for each clue, but I don't have any ideas on how to reduce that yet, since that's just asking the ClueDB for perfect matches.

timpalpant commented 8 years ago

I'm not sure I fully understand the algorithm yet, but might it be missing solutions in which some across answers win conflicts and other down answers win conflicts? Imagine there are two disjoint parts of the puzzle (say top-left and bottom-right corners). Both corners have across/down conflicts, but you might have a solution in which the across answers are used in the top-left and the down answers used in the bottom-right.

Also, you might want solutions in which not all of the possible answers are played (in case one of them is incorrect).

This doesn't help the combinatorial explosion problem though :)

jpalpant commented 8 years ago

That is definitely the case, it would miss all the mixed-conflict resolution solutions wouldn't it. Which increases the combinatorial explosion by another factor of something factorial, I bet. I'll think about that while I'm thinking about updating for the new Crossword class and thinking about not loading all my iterators into memory (my three tasks for the time being).

timpalpant commented 8 years ago

Here's an idea that may or may not be useful. The ClueDB gives us a set of possible answers for some clues in the crossword puzzle. (The Dictionary does too). Some of these answers are mutually exclusive (they indicate different letters for a single square), and we want to enumerate all valid (non-conflicting) solutions that result from placing a subset of the possible answers.

Label the set of possible answers returned from the ClueDB as A(1,1), A(1,2), D(2,1), etc., where A(1,1) indicates a possible answer for 1-across, A(1,2) indicates a second, different answer for 1-across, D(2,1) indicates a possible answer for 2-down, etc. Consider each possible answer a vertex in a graph, with edges between all pairs of vertices that are compatible (non-conflicting):

graph

All answers for the same clue conflict (e.g. A(1,1) and A(1,2)).
All answers for different across clues are connected (e.g. A(1,1) and A(2,1)). Similarly for down clues.
Some across answers will conflict with some down answers (e.g. A(1,1) and D(2,1)).

Valid, non-conflicting solutions are cliques of this graph: all of the selected answers must be compatible/connected to all other answers. We want to enumerate all of the cliques of this graph. There are at most 3^(n/3) maximal cliques (n: number of vertices), which can be listed with the Bron-Kerbosch algorithm (Hopefully with the above restrictions the actual number is smaller). Then all subgraphs of the maximal cliques are also (smaller) cliques. (In the solver, we probably want to yield maximal cliques first and then smaller solutions).

jpalpant commented 8 years ago

That makes sense and looks like a good thing to try! Especially because it does extend so well to the DictionarySolver (which was something I was thinking about while doing ClueDB). One good thing to think about is - do we really need to return non-maximal cliques? It'll increase the size of the set of returned solutions by a huge factor. If we're going to remove words from maximal solutions where all words are perfect matches in our database, we should do it selectively, like if removing a word is expected to open up a large branch of possible solutions or if a word is determined to be unlikely (both of which would be fun heuristics to write).

jpalpant commented 8 years ago

It bothers me that the branch name doesn't reflect what is actually going on, so I deleted this branch and pull request to make a new branch name that is more general. It closes the pull request, apparently, but the ClueDB branch wasn't near ready so I'll make a new one once cluedb_solver (the new branch) uses the new crossword class and I'll give a few tries at the graph method.

timpalpant / LittleBoxes

Cluedb dfs solve #5