Closed matsen closed 12 years ago
Okay, I've put together test data files and computed the expected results. The file below summarizes the results and points to all of the files you will need:
/home/matsengrp/working/csmall/rep_edges/notes.txt
Implemented! It still needs a description of what --rep-edges does and a sanity check, but it looks good to me.
Hm, this doesn't seem to work on microbiome demo without a rep-edges argument:
rhino2 microbiome-demo/src ‹dev› » eguppy pca --prefix test *jplace
Uncaught exception: Not_foundFatal error: exception
Not_foundRaised at file "map.ml", line 118, characters 16-25Called from file "array.ml", line 131, characters 31-51Called from file "src/batPervasives.ml", line 110, characters 27-30 Called from file "src/batList.ml", line 103, characters 17-20 Re-raised at file "printexc.ml", line 71, characters 10-11 Called from file "", line 0, characters 0-0
With no --rep-edges
flag, it should default to previous behavior.
On Thu, Mar 22, 2012 at 6:49 PM, Aaron Gallagher reply@reply.github.com wrote:
Implemented! It still needs a description of what --rep-edges does and a sanity check, but it looks good to me.
Reply to this email directly or view it on GitHub: https://github.com/matsen/pplacer/pull/235#issuecomment-4652185
Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/
Fixed.
It turns out that many edges have identical splitified columns, and this is one way of picking representatives from closely related sets.
This is a filter for after doing splitification that will only deliver a subset of the columns. In order to make this work, we need to figure out a way to adapt the splitify machinery to work with an array that only specifies results for a subset of the edges. It seems to me that adding in a map from the new array indices to the original edge numbers would work fine.
The
--rep_edges
takes a positive float argument, which I will callmax_edge_d
.Define the distance between edges to be the Euclidean distance between their columns in the splitification matrix.
We are going to partition the edges of the tree into connected regions, such that all pairs of edges within a region have distance
max_edge_d
between them. We are then going to take the most distal representatives of these connected regions as follows.The routine will be a depth-first recursion, that keeps an edge list
rep_edges
and an edge setcurr_edges
.Base case at a leaf is that
rep_edges
is empty andcurr_edges
contains the pendant edge leading to that leaf.At an internal node, say e is the edge directly above. Let the new
rep_edges
be the concatenation of therep_edge
list for the subtrees. Then letpossible_curr_edges
be the union of thecurr_edges
for the subtrees. Calculate the distance between each edge of that set and e. The newcurr_edges
is the set of edges that are less thanmax_edge_d
away from e, and then add any edge that is greater thanmax_edge_d
away from e to therep_edge
list. Ifcurr_edges
is empty, make it the single-edge set consisting of e. Recur.@metasoarous could you please put together a little unit test jplace file for this?