tjunier / newick_utils

shell tools for processing phylogenetic trees
Other
104 stars 31 forks source link

Remove all leaves #17

Open dridk opened 7 years ago

dridk commented 7 years ago

Hi,

I m working on the greengene tree avaible here : gg_13_5_otus_99_annotated.tree.gz http://greengenes.secondgenome.com/downloads/database/13_5

This file contains a tree based on 16S RNA from bacteria. I would like to extract a simple relation , for example the tree of : g__Staphylococcus, g_Streptococcus, g_Enterococcus. Unfortunally, for each species , there are many leaves labeled with number. Which probably correspond to the sequence ID.

This command print all nodes except leaves. I get all my taxonomy

 nw_labels -L gg_13_5_otus_99_annotated.tree 
     # Output : 
     s__sedula
     g__Metallosphaera
     g__Acidianus
     g__Stygiolobus
     s__metallicus
     g__Sulfolobus
     g__Sulfolobus 

This command print only leaves. I get unwanted number list .

    nw_labels -I gg_13_5_otus_99_annotated.tree 
    # Output : 
     550922
     1113159
     569299
     1106705
     1104518
     556057
     3119364

So, I don't know how to remove all leaves and keep only a tree of taxon name. I m sure it's possible with newick tools, but didn't find any way . Could you give me some clues ?

tjunier commented 7 years ago

Hi Sacha,

Unfortunately what you try to do is not currently possible with the newick utilities. The reason is that the current behaviour is not to keep nodes with a single child, which is what would happen when you remove the first leaf of a two-leaf node.

While this behaviour makes sense in a lot of situations, I see that in your case it would be better to just remove the leaves. I will see if I can add this functionality, but I can make no guarantee, as I have lots of other projects that also need my attention.

Cheers,

Thomas