Decompose crossword into mostly independent subgraphs

Break down the entire crossword into smaller blocks where fills can be enumerated more exhaustively, then combine the best solutions from these blocks.

For example, in this past Sunday's crossword, you can isolate a block in the top-left corner by making only two cuts:

xword2

The two long words that are cut are most-likely phrases, and have no possible words in the dictionary anyway.

This could be a Solver that first decomposes the puzzle into blocks by finding minimum node cuts, passes each block to a sub-solver, then merges the results from the sub-solvers into a complete solution.

I have several thoughts on this, mostly about how to structure the problem so that it has a solution and what is trying to be found. My thinking is that we are trying to find a region of the puzzle that is 1) small enough to be easily solvable, 2) large enough to be useful, and 3) does not depend on the solution to the rest of the puzzle.

I've thought of two structures that could describe this problem. The first is a graph where nodes are clues and edges exist where there are potential conflicts or dependencies between clues. In the above example, the nodes 27A and 42A would represent a boundary between the subgraph in the top left and the rest of the puzzle. This structure is easy to implement - we have clues that can be used as nodes already, and the graph is just a conflict map of clues, instead of clue fills. It would also be static for the crossword puzzle.

We would then pass the subgraph consisting of {1D, 2D, 3D, 4D, 19D, 1A, 18A, 23A, 31A, 36A} to a subsolver while putting the nodes cut {27A, 42A} into a "do not solve" list. We don't want to solve for these clues because removing those clues is how we remove the dependency between the subgraph and the rest of the puzzle. The trouble comes in that I don't know how the minimum_node_cut algorithm of networkx works - will it attempt to find large subgraphs? The way it's written I think it will do goals 1 and 3, but I'm not sure about 2.

The other structure is a graph with grid squares as nodes, where edges are adjacencies between squares. Edges could be labelled by the clues that control them, so the first cut drawn above would be an edge cut controlled by the clue 27A, between squares (4,3) and (5,3). The problem is then to search for edge cuts that can be made to isolate communities we want to solve for, then to pass that community of boxes to a subsolver. The subsolver would either solve the clues needed to fill those boxes (excluding any clues whose edges needed to be cut in order to sever the community), or it could simply attempt to fill the boxes without considering the clues.

The second structure lends itself to the graph cut problems I've been looking at, which usually identify edges to cut, not nodes. We might want more than a minimum cut - we don't want to isolate box (0,0) by cutting edges 1D and 1A, that community is too small, but it is a valid minimum edge cut. The sparsest cut problem might be relevant, but it sounds difficult and I don't see many algorithms for it. Community detection is something I've read a lot about, and there are a few good heuristics for it listed on stackoverflow but that wonderful masterpost is about igraph, not NetworkX. NetworkX doesn't seem to have much in the way of community detection though.

Thoughts? I like the first structure more, but am not sure about the minimum node cut algorithm. We could maybe write the walktrap detection community detection method though, it sounds simple. fastgreedy requires computing other things like modularity, and I don't think NetworkX has that.

I took a foray into igraph-python, which is a Python wrapper around a C graph library. It's fast (that link basically just says "C is ~100x faster than Python") and seems very complete, even moreso than NetworkX. The documentation isn't as easy to read, and the pip install takes a bit longer on account of compiling. But it can, for instance, do this:

I took the first approach I described above, with a graph of clues with edges between possible conflicts. I only did the top-left corner of the graph, from 1A to down to about 58A and over to 8D. I then piped that graph into a few stock community detection algorithms in igraph: infomap and walktrap.

This is what I got: Walktrap:

Infomap: infomap

Walktrap did slightly better in this case, with the cluster you identified highlighted in blue, but both are comparable, and there's a slew more to try out if they don't perform well on the full puzzle, but I suspect they would.

Thoughts on this solution and/or changing the NetworkX dependency out for igraph? I could rewrite the clique algorithm and the ClueDBCliqueSolver trivially.

Looks pretty great! I think it's worth seeing how it breaks down the full puzzle. Feel free to add igraph and/or swap it out for NetworkX if you think it's better, too.

Your analysis of the 3 criteria we need is very well put. I'm not sure which of the two approaches will work better (node-cuts or edge-cuts), but it's probably not too hard to try both. I played around a little with the minimum_node_cut algorithm, and you're right that blindly applying it does not lead to good solutions because it does not care how large the subgraphs are. The related function k_components is more like what we want, since it tries to find maximal subgraphs requiring at least k cuts to isolate. I did not expect when starting LittleBoxes to spend so much time in the graph theory sections of Wikipedia :)

Another important problem will be to figure out how to merge the solutions of the subgraphs. If they're perfectly compatible it will be easy, but otherwise it's not clear what the best way to resolve conflicts will be.

I've been playing around with igraph more and find it 1) frustrating and not very Pythonic and 2) very powerful. It's fast but since the interface is so clunky I don't think I want to do it. That said, I might look at the source and see how some of the algorithms it uses work because I tried out what I was doing before with a whole puzzle.

I decided to stick with walktrap because it was so intuitive and had good performance before. It's still instantaneous with the larger graph, but without parameters it tends to make communities too large: daily-2016-03-20-walktrap4

The one parameter it takes is a random-walk distance, the number of edges it should traverse in the random walks it uses to find communities. The default is 4, and I think that's too many, so I tried with three: daily-2016-03-20-walktrap3 Which surprisingly is the same as with 4. Probably means something, but regardless I didn't really like that community structure.

I also tried two: daily-2016-03-20-walktrap2

The size of the communities shrunk as the walk distance got smaller, and I think smaller communities are better here.

As far as conflict resolution goes: I was thinking of dropping clues and not solving for ones that might cause conflicts between the subgraphs. It means no conflict resolution issues, but also that you aren't solving for some clues. But I had good ideas on how to select which clues to drop that I want to implement.

In the meantime though, I think I might try to write a good walktrap using NetworkX. It'll be slower, but these graphs are small (~70 nodes and ~200 edges) and these algorithms are mostly free when it's that small. NetworkX just resembles Python much more.

timpalpant / LittleBoxes

Decompose crossword into mostly independent subgraphs #30