rrnewton / PhyBin

Binning (Newick) Phylogenetic Trees by Topology
http://people.csail.mit.edu/newton/phybin
Other
10 stars 4 forks source link

Phybin prune option #10

Open JosieReinhardt opened 9 years ago

JosieReinhardt commented 9 years ago

The prune option does not appear to be working properly - on my 22 taxa dataset, binning works successfully when I run the full dataset, but fails for one of two reasons when I use --prune.

First, the program errors out and produces no output trees/clusters if I specify > 5 taxa with --prune, and this seems to be in any combination.

Second, if I specify 5 or fewer taxa, the program completes, but the output doesn't make sense. A single cluster is produced, regardless of the edit distance I specify, and the consensus tree output file includes taxa that were not specified with --prune, whereas the alltrees file includes only taxa specified with prune.

The output from two examples demonstrating these is pasted below:

First issue (crash when > 5 taxa are specified with --prune)

$ phybin --complete ./ntXraxml_nttree/*.* -v --editdist=10 --prune="TdKND2 TdGND2 TdGD TdKD TdLD TDALM" -o ./phybin_comp/
Cleaning away previous phybin outputs...
Parsing 489 Newick tree files.
Total unique taxa (22):
  DA EA SBECC SE DM DS TQUIN TwGD TwKD TwKND TwGND TdKND2 TdGD TdKD TdLD TdKND1 TDALM TdL TP TdGND2 TdS TW
Note: defaulting to expecting ALL 6 to be present..
..................
 WARNINGs....
....
Number of input tree files: 489
PRUNING trees to just these taxa: ["TdKND2","TdGND2","TdGD","TdKD","TdLD","TDALM"]
Number of bad/unreadable input tree files: 58
Number of VALID trees (correct # of leaves/taxa): 431
Total tree nodes contained in valid trees: 2586
Average branch len over valid trees: 0.4148210083130796
Max/Min branch lengths: (133.7779417995337,0.0)
 Using HashRF-style algorithm...
 Built matrix for dim 431
Time to compute distance matrix: 0.09973s
Clustering using method CompleteLinkage
 [finished] Wrote full dendrogram to file dendrogram.txt
Sanity checked dendrogram of size: 431
Combining all clusters at distance less than or equal to 10
 [async] writing dendrogram as a graph to dendrogram.dot
After flattening, cluster sizes are: [431]
 Outcome: 1 clusters found, 1 non-singleton, top bin sizes: [431]
  Up to first 30 bin sizes, excluding singletons:
  * cluster#1, members 431, 
 [finished] Wrote contents of each cluster to cluster<N>_<size>.txt
 [finished] Wrote representative (consensus) trees to cluster<N>_<size>_consensus.tr
NOT creating processes to build per-cluster .pdf visualizations. (Not asked to.)
Waiting for 2 asynchronous tasks to finish...
phybin: bipsToTree: Internal error!  No match for bip: fromList [11,16] out is
 [(fromList [0],NTLeaf () 0),(fromList [1],NTLeaf () 1),(fromList [2],NTLeaf () 2),(fromList [3],NTLeaf () 3),(fromList [4],NTLeaf () 4),(fromList [5],NTLeaf () 5)]
 and remaining bips 2
 when processing orig bip set:
  fromList [fromList [0,1,2,3,4,5],fromList [11,16],fromList [12,14]]

Second issue (weird output when <= 5 taxa are specified with --prune)

$ phybin --complete ./ntXraxml_nttree/*.phy -v --editdist=10 --prune="TdKND2 TdGND2 TdGD TdKD TdLD" -o ./phybin_comp/
Cleaning away previous phybin outputs...
Parsing 489 Newick tree files.
Total unique taxa (22):
  DA EA SBECC SE DM DS TQUIN TwGD TwKD TwKND TwGND TdKND2 TdGD TdKD TdLD TdKND1 TDALM TdL TP TdGND2 TdS TW
Note: defaulting to expecting ALL 5 to be present..
..................
 WARNINGs...
...
Number of input tree files: 489
PRUNING trees to just these taxa: ["TdKND2","TdGND2","TdGD","TdKD","TdLD"]
Number of bad/unreadable input tree files: 58
Number of VALID trees (correct # of leaves/taxa): 431
Total tree nodes contained in valid trees: 2155
Average branch len over valid trees: 0.46325352648325513
Max/Min branch lengths: (133.7779417995337,0.0)
 Using HashRF-style algorithm...
 Built matrix for dim 431
Time to compute distance matrix: 0.011019s
Clustering using method CompleteLinkage
 [finished] Wrote full dendrogram to file dendrogram.txt
Sanity checked dendrogram of size: 431
Combining all clusters at distance less than or equal to 10
 [async] writing dendrogram as a graph to dendrogram.dot
After flattening, cluster sizes are: [431]
 Outcome: 1 clusters found, 1 non-singleton, top bin sizes: [431]
  Up to first 30 bin sizes, excluding singletons:
  * cluster#1, members 431, 
Dendrogram graph size: 1
 [finished] Wrote contents of each cluster to cluster<N>_<size>.txt
 [finished] Wrote representative (consensus) trees to cluster<N>_<size>_consensus.tr
NOT creating processes to build per-cluster .pdf visualizations. (Not asked to.)
 [async] Next, plot dendrogram.pdf
Waiting for 2 asynchronous tasks to finish...
 [finished] Writing dendrogram diagram (0.108006s)
Phybin completed.
$ cat ./phybin_comp/cluster1_431_consensus.tr 
(DA, EA, SBECC, SE, DM);
$ head -4 ./phybin_comp/cluster1_431_alltrees.tr 
(TdKND2, (TdKD, ((TdLD, TdGD), TdGND2)));
((TdLD, (TdGD, TdKD)), (TdGND2, TdKND2));
(((TdLD, (TdKD, TdGD)), TdGND2), TdKND2);
(((TdLD, TdGD), TdKD), (TdGND2, TdKND2));
rrnewton commented 9 years ago

Thanks for this report. We should be able to help fix this. We'll try reproducing with one of our data sets, first. Otherwise, if there's any public dataset that gives the same error that would be a great starting point.

JosieReinhardt commented 9 years ago

Thanks,

I ended up pre-pruning my dataset using another tool (tree_doctor) and then everything worked fine. But, I figured you'd want to know anyway. I'm not sure about a public dataset but if you do want mine to reproduce the error, I'd be happy to share an anonymized version.

Josie

On Thu, Aug 6, 2015 at 1:17 PM, Ryan Newton notifications@github.com wrote:

Thanks for this report. We should be able to help fix this. We'll try reproducing with one of our data sets, first. Otherwise, if there's any public dataset that gives the same error that would be a great starting point.

— Reply to this email directly or view it on GitHub https://github.com/rrnewton/PhyBin/issues/10#issuecomment-128448764.