matsen / pplacer

Phylogenetic placement and downstream analysis
http://matsen.fredhutch.org/pplacer/
GNU General Public License v3.0
74 stars 18 forks source link

Prefiltering of constant edges for edgePCA #238

Closed metasoarous closed 12 years ago

metasoarous commented 12 years ago

A significant portion of the edges for large trees are filled entirely with -1s and +1s in the resulting splits matrices. Cutting out these edge columns has no effect on the PCA results, since there is no variance for these edges, and can reduce the number of columns in the splits matrix by an order of magnitude.

So, before running PCA, we should cut out these edges, just as we do with --rep-edges. And on that note, this prefiltering will have to play nicely with the --rep-edges filtering. The order of operations should be:

FilterNonConstantEdges -> FilterRepEdges -> ePCA -> MergeRepEdges -> MergeNonConstantEdges

Since there is no loss of information here, we should always run this before doing PCA. As with the --rep-edges, just stick in zeros for the edge columns which get filtered out.

metasoarous commented 12 years ago

For splitify, it should only do filtering on non constant edges if an epsilon is explicitly passed. Otherwise, troubles can come up trying to do guppy heat and such on split files if the number of columns isn't matching. Defaulting to an epsilon of zero is fine for epca though.