matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
75 stars 18 forks source link

`compress` command to combine pqueries that are within a certain distance #157

Closed matsen closed 13 years ago

matsen commented 13 years ago

This one comes after #156.

A cutoff c is specified via a command line flag. We will be merging pairs of pqueries that have KR distance between them less than c.

We will need to put the pqueries in an equivalently-ordered array so that we can go from indices to actual pqueries. Because we will be wanting to go back and forth between the actual pqueries and entries in the matrix, I strongly suggest just using the indices instead of the actual pqueries below. I'll still call them pqueries or nodes for convenience.

Finding a vertex cover

The goal is to find a set S of nodes such that each edge has at least one if its vertices in S. We will say that an edge e is covered by a node w if one of the vertices of e is w.

Maintain a list of thus-far selected nodes, as well as a count for every node. This count is the number of additional edges that will be covered if that node is added. For example, say a node w touches three edges, and that the other endpoints of these edges are x, y, and z. Say x is already part of our list of thus-far selected nodes. Then the count for w would be two, because adding w would cover the edges for y and z.

The algorithm is as follows. The selected node set starts out empty, and the count array C is initialized for node w with the number of edges touching w. On every iteration:

Repeat until there are no (strictly) positive values in the count array C.

This isn't a solution to the minimum vertex cover`, of course, but what we really want is to pull together the pqueries that form real clusters. The ones with high degree are thus natural to pick first, which is exactly what we are doing here.

Merging the pqueries

Each original pquery (=node) will then get merged into one of the the selected pqueries. This will happen as follows. Maintain a set of unmerged pqueries, and a set of pairs (w, d(w)), where w is a selected pquery and d(w) is the degree of w in the graph.

Stop when the unmerged pquery set is empty. If the pair-set is exhausted but the unmerged pquery set is not, that's reason for an error.

Note that this merging could be done at the same time as the vertex cover thing. That's probably a better design, and let's refine this.

Completion

Will include some simple unit tests, showing that it correctly merges pqueries that are close, and not things that aren't.

matsen commented 13 years ago

Please use --cutoff rather than -c, as -c is always reserved for reference package.

habnabit commented 13 years ago

Reopening to fix issues described offline.