matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
75 stars 18 forks source link

--rep-edges for splitify subcommand #235

Closed matsen closed 12 years ago

matsen commented 12 years ago

It turns out that many edges have identical splitified columns, and this is one way of picking representatives from closely related sets.

This is a filter for after doing splitification that will only deliver a subset of the columns. In order to make this work, we need to figure out a way to adapt the splitify machinery to work with an array that only specifies results for a subset of the edges. It seems to me that adding in a map from the new array indices to the original edge numbers would work fine.

The --rep_edges takes a positive float argument, which I will call max_edge_d.

Define the distance between edges to be the Euclidean distance between their columns in the splitification matrix.

We are going to partition the edges of the tree into connected regions, such that all pairs of edges within a region have distance max_edge_d between them. We are then going to take the most distal representatives of these connected regions as follows.

The routine will be a depth-first recursion, that keeps an edge list rep_edges and an edge set curr_edges.

Base case at a leaf is that rep_edges is empty and curr_edges contains the pendant edge leading to that leaf.

At an internal node, say e is the edge directly above. Let the new rep_edges be the concatenation of the rep_edge list for the subtrees. Then let possible_curr_edges be the union of the curr_edges for the subtrees. Calculate the distance between each edge of that set and e. The new curr_edges is the set of edges that are less than max_edge_d away from e, and then add any edge that is greater than max_edge_d away from e to the rep_edge list. If curr_edges is empty, make it the single-edge set consisting of e. Recur.

@metasoarous could you please put together a little unit test jplace file for this?

metasoarous commented 12 years ago

Okay, I've put together test data files and computed the expected results. The file below summarizes the results and points to all of the files you will need:

/home/matsengrp/working/csmall/rep_edges/notes.txt
habnabit commented 12 years ago

Implemented! It still needs a description of what --rep-edges does and a sanity check, but it looks good to me.

matsen commented 12 years ago

Hm, this doesn't seem to work on microbiome demo without a rep-edges argument:

rhino2 microbiome-demo/src ‹dev› » eguppy pca --prefix test *jplace
Uncaught exception: Not_foundFatal error: exception

Not_foundRaised at file "map.ml", line 118, characters 16-25Called from file "array.ml", line 131, characters 31-51Called from file "src/batPervasives.ml", line 110, characters 27-30 Called from file "src/batList.ml", line 103, characters 17-20 Re-raised at file "printexc.ml", line 71, characters 10-11 Called from file "", line 0, characters 0-0

With no --rep-edges flag, it should default to previous behavior.

On Thu, Mar 22, 2012 at 6:49 PM, Aaron Gallagher reply@reply.github.com wrote:

Implemented! It still needs a description of what --rep-edges does and a sanity check, but it looks good to me.


Reply to this email directly or view it on GitHub: https://github.com/matsen/pplacer/pull/235#issuecomment-4652185

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

habnabit commented 12 years ago

Fixed.