jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
206 stars 62 forks source link

can node labels be parsed as support values for tree rooting purposes? #53

Closed jhcepas closed 8 years ago

jhcepas commented 8 years ago

Hi @jeetsukumaran and @mtholder! I have recently expanded the analysis by Czech and Stamatakis regarding the incorrect mapping of support values after tree rooting in tree viewers, and found that multiple toolkits (perhaps also dendropy) might be affected.

@lczech, @stamatak and I are now working on reporting/confirming the bug in more frameworks and tree viewers.

Current tests are here: https://github.com/jhcepas/test_branch_support_after_tree_rerooting

regarding dendropy, I don't know why the rooted topology returned looks as unrooted in the image. Is that normal? In any case, it seems that the support values are not mapped correctly (i.e. the support value of 20 does not correspond to the same partition as in the original tree).

Could you take a look and confirm that this is not a problem in our tests? tx!

jeetsukumaran commented 8 years ago

Jaime (and @mtholder, @lczech, @stamatak)

(1) As documented, DendroPy assumes unrooted unless noted otherwise. You will need a rooting token ("[&R]") or otherwise inform DendroPy at parse time that the tree is rooted:

tree = Tree.get(
         data=tree_str,
         schema="newick",
         rooting="force-rooted",
         )
outgroup_node = tree.find_node_with_taxon_label("X")
tree.to_outgroup_position(outgroup_node, update_bipartitions=False)
print(tree.as_string(schema='newick').strip())

(2) Note that in example you give, in no way is DendroPy doing ANY support mapping at all. It is simply retaining the labels associated with the node that they were originally associated with. Calling this "incorrect" is, in fact, incorrect. To DendroPy, these are just labels, and, moreover, labels associated with nodes and NOT bipartitions. When rerooting trees, this association is preserved simply with no attempt at remapping. There should be no expectation of anything different.

(2) The labels-on-nodes <=> support for bipartitions is fragile relationship, especially when using Newick format. The assumed equivalency fallacy between the two is the core of the issue. I think that the issue here is not so much tree viewers, but tree viewer USERS. Users need to be aware a node label is a node label, nothing more. Rerooting the tree shuffles the nodes around, and they take their label with them.

(3) DendroPy has a very rich and rigorous infrastructure for dealing with bipartitions and support for bipartitions on trees. All this is bypassed when labels on nodes are used. To DendroPy, a label on an node is just a label on a node. The fact that it has become a convention to sometimes express support values there is not just an artifact, but an idiosyncratic artifact of the Newick format.

(4) To do this operation correctly in DendroPy, after the tree has been parsed, (a) the bipartition hashes need to be calcualted; (b) the node labels need to be interpreted and assigned to the bipartitions; (c) the tree re-rooted; (d) the support values recalculated and expressed as node value based the bipartition supports. There is no ready-to-go function in the library to the do the above currently in DendroPy, as their has not been any demand for it (all support value mapping/manipulation is done on raw data through SumTrees and the underlying TreeArray class). However, this function will be trivial to write, and I can sketch a simple version up if you want.

(6) This will still, however, result in null/None value at the original old root, if the original program did not assign a support label there, or an incorrect value if the original program assigned a support value of 1.0. There is no way to get the support value for this original root without re-scoring the support.

(7) The reason for the above is what I think is a pervasive fallacy (IMHO) with the way people think about rooting. Most people think that re-rooting tree is a simple visual operation. In fact, it is a model-changing operation. Just like rooting an unrooted tree is a change of the tree model. In both cases by use of outside information rather than anything in the data or by recourse to statistical procedure. Sometimes the support values as scored under the original tree model can be carried across without modification. Because of this, and the ease/seamlessness that tree viewers do this, folks do not think re-rooting/rooting is a statistically-meaningful operation, considering it mostly aesthetic. I disagree.

(8) Of course, there is no reason that tree viewers need not up their game and change to conform to user's expectations: i.e., consider the node labels as mapping to a bipartition rather than a node, as I described when outlining how to do this operation correctly with DendroPy .

-- jeet

On 5/17/16 10:44 AM, Jaime Huerta-Cepas wrote:

Hi @jeetsukumaran https://github.com/jeetsukumaran and @mtholder https://github.com/mtholder! I have recently expanded the analysis by Czech and Stamatakis regarding the incorrect mapping of support values after tree rooting http://biorxiv.org/content/early/2015/12/25/035360 in tree viewers, and found that multiple toolkits (perhaps also dendropy) might also be affected.

@lczech https://github.com/lczech, @stamatak https://github.com/stamatak and I are now working on reporting/confirming the bug in more frameworks and tree viewers.

Current tests are here: https://github.com/jhcepas/test_branch_support_after_tree_rerooting

regarding dendropy, I don't know why the rooted topology returned looks as unrooted in the image. Is that normal? In any case, it seems that the support values are not mapped correctly (i.e. the support value of 20 does not correspond to the same partition as in the original tree).

Could you take a look and confirm that this is not a problem in our tests? tx!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/53


Jeet Sukumaran

jeetsukumaran@gmail.com

Blog/Personal Pages: http://jeetworks.org/ GitHub Repositories: http://github.com/jeetsukumaran Photographs (as stream): http://www.flickr.com/photos/jeetsukumaran/ Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

mtholder commented 8 years ago

Hi, I agree that, by default, text after a close parentheses should be treated as a text label of the node (as the newick standard says). But it would be nice to have an easy way to push a node property into an edge data struct in dendropy. (and then display the edge property on in the ascii tree). there may be an easy way, I'm not sure...

jhcepas commented 8 years ago

@jeetsukumaran @mtholder, I agree on every point. I had the same problem in ETE and I included an option to parse those numbers as real support values, so they are not treated as labels and are correctly mapped to branches when re-rooting. I found it so pervasive that is the default method for reading newick strings in ETE.

jeetsukumaran commented 8 years ago

@mtholder, WRT to node vs edge labels. One of the big bugbearboo-boos with Newick. There is no shortage of ways if you are willing to come up and use with YAINFH (Yet Another Idiosyncratic Newick Format Hack). Anything from a symbol, such as a hash mark to indicate the label should be associated with the subtending edge rather than node, or something using comment metadata.

Or, serialize using a format that explicitly supports distinct expression of edge vs. node labels, such as NeXML, etc.

lczech commented 8 years ago

Wow, lot of interesting discussion already going on here, nice. It seems all of us had their problems with the Newick format... Let's try to fix this!

According to Phylip (which I consider the "standard" definition), there is no such thing as a "node label" in Newick, so all of this is a convention anyway. Problematically, every tool adheres to their own interpretation. Thus, I disagree that it is the users responsibility to keep track of the different conventions being used by all the tools they use. Instead, the tools should offer proper support for the data.

That implies the following. If for certain data (e.g., labels in Newick) there is no way to unambiguously distinguish their meaning (i.e., node vs branch label), the tool should not simply assign a (preferred) interpretation. It should instead ask the user what semantics he or she actually expects. The newer versions of Dendroscope (since 3.5.0) already do this nicely. This is the Principle of Least Astonishment.

Furthermore, if a particular tool internally only offers one of the interpretations (i.e., the internal data structures just use node labels, but do not have branch labels), the tool should still tell the user what is happening. This way, errors are prevented, while implicitly educating users about the issue.

By the way, I like "YAINFH", @jeetsukumaran ;-)

jeetsukumaran commented 8 years ago

@jhcepas and @lczech

The luxury of a restricted set of use cases allows this approach: i.e. a default interpretation of those labels. However, DendroPy is a library rather than a tool. It has a very broad range of use cases. And those poor, bloated, used-and-abused, overdetermined Newick node-and-edge-simultaneously labels get used for a LOT of things apart from support values by lots of different people: internal node taxa, fossil calibration times, population sizes, trait states, etc. etc. Assuming one of these as a default is going to Astonish a whole lot of people. Even "support values" in the labels often are more complex than just numeric: e.g., many people have support values as established under multiple different approaches concatenated together in the labels. So, we default to treating the label as just a label, and provide the infrastructure to interpret these as needed. You are right that this needs to be clearly documented, though. And we do that! Typically in multiple places (e.g., the tutorials as well as library API documentation). The problem is, as I am sure you are aware, the documentation is rarely visited by users! Each of whom assumes that their use case is the only use case that makes sense and wants to library to work naturally (by default) based on their needs, perspectives, expectations, without taking into account the broader range of use cases that may impose very different natural defaults.

lczech commented 8 years ago

Hm, I don't fully understand what you want to say.

You state

Assuming one of these as a default is going to Astonish a whole lot of people.

which is exactly my point: don't assume - ask the user!

Further, you say

So, we default to treating the label as just a label, and provide the infrastructure to interpret these as needed.

This is IMO the best way to handle this. But, if I understood correctly, you still assume them to be node labels, right? That's the crux. I think, this also should be up to the user to decide - particularly because they often do not read the manual!

That means, when reading data in Newick, there should be a way to distinguish between the interpretations. AFAIK, ETE does this by offering a flag paramter in the reading function, right @jhcepas?! Another way would be to offer multiple reading functions, or (in languages like C++, where you have strong types), read into a data structure that has the desired kind of member variables.

This way, the user will automatically stumble upon the issue, because function parameters etc are parts of the manual that one needs to read in order to use the function.

So, in summary, I think we are on the same page. Our views just seem to differ on how to incorporate the "user choice" into software. What do you think?

Also, as a side comment, this topic would be more appropriately discussed in some bioinformatics forum, because it affects more then just DendroPy. This is not meant to be critique on your software, but a more general discussion. Hope you don't mind discussing this here.

jhcepas commented 8 years ago

@lczech yeap, we use a flag when loading trees, but by no means has to be the default in other toolkits.

I have changed the title of this issue, as I think we are discussing different things. The question here would be if/how dendropy can parse node labels as branch support values, as this is very common in some scenarios. I think I understood from @jeetsukumaran that there is, indeed, a way. (?)

jeetsukumaran commented 8 years ago

You state

Assuming one of these as a default is going to Astonish a whole lot of people.

which is exactly my point: don't assume - ask the user!

I was referencing ETE interpreting the values as support values as default.

Also note that, as DendroPy is a programming library, not a tool, there is no real way to ask the user. The user has to understand the API sufficiently well to use the correct functions/arguments to get what they want. And, as Python is Turing-complete, they can get anything they want ... ;)

Further, you say

So, we default to treating the label as just a label, and provide

the infrastructure to interpret these as needed.

This is IMO the best way to handle this. But, if I understood correctly, you still assume them to be node labels, right? That's the crux. I think, this also should be up to the user to decide.

In a sense, the user can "decide" this, by simply writing:

for nd in tree:
     nd.edge.label = nd.label

But you are saying that we should offer the option of the above association being available to the user at parse time of NEWICK.

Fair enough: it's easy to do. But it still needs a default, and defaulting to assume the label is associated with the node is convention that goes back a long way and is fairly pervasive.

Note that there is a difference between assuming a label is associated with a node or edge vs. a label should be interpreted as a support value, population size, or whatever.

The first is a syntactic assumption, and, if properly documented or at least consistently enforced, can be be considered a format specification whether. The role and responsibility for "reading" this lies with the parser. The second is a semantic assumption. The role and responsibility for this, as far as DendroPy is concerned, lies with the programmer. I recognize that for more end-user oriented tools like others being discussed here, the tool itself may take up the responsibility.

-- jeet

On 5/17/16 1:05 PM, Lucas Czech wrote:

Hm, I don't fully understand what you want to say.

You state

Assuming one of these as a default is going to Astonish a whole lot
of people.

which is exactly my point: don't assume - ask the user!

Further, you say

So, we default to treating the label as just a label, and provide
the infrastructure to interpret these as needed.

This is IMO the best way to handle this. But, if I understood correctly, you still assume them to be node labels, right? That's the crux. I think, this also should be up to the user to decide - particularly because they often do not read the manual!

That means, when reading data in Newick, there should be a way to distinguish between the interpretations. AFAIK, ETE does this by offering a flag paramter in the reading function, right @jhcepas https://github.com/jhcepas?! Another way would be to offer multiple reading functions, or (in languages like C++, where you have strong types), read into a data structure that has the desired kind of member variables.

This way, the user will automatically stumble upon the issue, because function parameters etc are parts of the manual that one needs to read in order to use the function.

So, in summary, I think we are on the same page. Our views just seem to differ on how to incorporate the "user choice" into software. What do you think?

Also, as a side comment, this topic would be more appropriately discussed in some bioinformatics forum, because it affects more then just DendroPy. This is not meant to be critique on your software, but a more general discussion. Hope you don't mind discussing this here.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/53#issuecomment-219785255


Jeet Sukumaran

jeetsukumaran@gmail.com

Blog/Personal Pages: http://jeetworks.org/ GitHub Repositories: http://github.com/jeetsukumaran Photographs (as stream): http://www.flickr.com/photos/jeetsukumaran/ Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

jeetsukumaran commented 8 years ago

there is no such thing as a "node label" in Newick,

Ah yes, but there IS in NEXUS:

http://sysbio.oxfordjournals.org/content/46/4/590.full.pdf+html

Here, note how the label is explicitly used to label the NODE, not the edge.

THIS is the basis of DendroPy's syntactic (not semantic) assumption that the label is associated with the node rather than the edge.

You are right that it would be convenient to have an option when parsing NEWICK to allow for the label association to be selected by the user. However, given the above, I think it is a sensible default to have the label be associated with the node, whether or not this astonishes the user.

-- jeet

On 5/17/16 12:27 PM, Lucas Czech wrote:

According to Phylip http://evolution.genetics.washington.edu/phylip/newicktree.html (which I consider the "standard" definition), so all of this is a convention anyway.


Jeet Sukumaran

jeetsukumaran@gmail.com

Blog/Personal Pages: http://jeetworks.org/ GitHub Repositories: http://github.com/jeetsukumaran Photographs (as stream): http://www.flickr.com/photos/jeetsukumaran/ Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

jeetsukumaran commented 8 years ago

@jhcepas I see the test data. Perhaps it would be useful to have an "expected output" tree string as well?

I am also curious as to what sort of value you reasonably expect to see in the old (original) root.

jeetsukumaran commented 8 years ago

@jhcepas

As the following code should make clear, IMHO, relying on node labels to store support values is a fragile, dangerous, and, in the long run, unsustainable hope. Once the tree gets re-rooted (i.e., you change the model of the tree from the model that was originally scored and to which, strictly speaking, is the only model to which the support values can be applied without additional data), you get "new" internal nodes for which no information is available.

Am I missing anything?

#! /usr/bin/env python

import dendropy

def get_tree():
    tree_str = "((C:1.0,D:1.0)10:0.001,(A:1.0, (B:1.0,X:1.0)30:0.01)20:0.1,E:1.0)0:0.0;"
    tree = dendropy.Tree.get(
            data=tree_str,
            schema="newick",
            rooting="force-rooted",
            )
    return tree

def naive_method():
    tree = get_tree()
    outgroup_node = tree.find_node_with_taxon_label("X")
    new_root = outgroup_node.parent_node
    tree.reseed_at(new_root)
    # print(tree.as_string("newick").strip())
    return tree

def support_aware_method():
    tree = get_tree()
    benc = tree.encode_bipartitions()
    support_values = {}
    for nd in tree:
        support_values[nd.bipartition] = float(nd.label) if nd.label is not None else 1.0

    outgroup_node = tree.find_node_with_taxon_label("X")
    new_root = outgroup_node.parent_node
    tree.reseed_at(new_root)

    tree.encode_bipartitions()
    for nd in tree:
        nd.label = support_values.get(nd.bipartition, "Not_specified_in_original_tree")
    return tree

print("\n[Original Tree:]\n    ")
print(get_tree().as_string("newick").strip())
print("\n[Naive Tree:]\n    ")
print(naive_method().as_string("newick").strip())
print("\n[Support-Aware Tree:]\n")
print(support_aware_method().as_string("newick").strip())

Results in:

[Original Tree:]

[&R] ((C:1.0,D:1.0)10:0.001,(A:1.0,(B:1.0,X:1.0)30:0.01)20:0.1,E:1.0)0:0.0;

[Naive Tree:]

[&R] (B:1.0,X:1.0,(A:1.0,((C:1.0,D:1.0)10:0.001,E:1.0)0:0.1)20:0.01)30:0.0;

[Support-Aware Tree:]

[&R] (B:1.0,X:1.0,(A:1.0,((C:1.0,D:1.0)10.0:0.001,E:1.0)'Not_specified_in_original_tree':0.1)'Not_specified_in_original_tree':0.01):0.0;
jhcepas commented 8 years ago

for the ETE API (also not a tool), users can use a flag when loading node labels (the text string close to the perenthesis) as 1) node.support or 2) node.name.

After loading, node.support is always treated as the statistical branch support, so it is correctly re-mapped when rooting. For the "new" node without support value created after rooting, ETE uses the default support value, which is, 1. That's also not good and I should find a way to allow for missing support values.

The rest of node attributes (node.name and anything else loaded using the extended newick format) are used as in dendropy, simply as node labels. IMO, also not totally good, as many labels are actually branch-labels that should be cleared or moved with branches when the tree is rearranged. Or at least warn the user to avoid misinterpretations. For instance, internal taxids, population sizes, etc should be probably cleared, as they might loose their meaning after re-rooting. But that would require a proper newick standard, or using NExML or similar.

jhcepas commented 8 years ago

at this point I am even wondering if the distinction between nodes and branches makes any sense :)

jeetsukumaran commented 8 years ago

at this point I am even wondering if the distinction between nodes and branches makes any sense :)

In the sense that each node has unique edge subtending it and only it ... no.

But in the sense that each edge ALSO represents a split/bipartition, the value of which may change during tree restructuring even if the particular node that it subtends does not ... YES!

mtholder commented 8 years ago

A gist: https://gist.github.com/mtholder/8c5a4079d04a0129c652e431575add01 tweaks @jeetsukumaran's example. The code change is to allow the label to be treated as a support statement for the unrooted bipartition.

I also changed the branch lengths and support values to make them easier to trace on the tree (not just 0s and 1s). I get:

[Original Tree:]

[&R] ((C:1.3,D:4.0)34:0.034,(A:1.1,(B:1.2,X:1.6)26:0.026)12:0.0126,E:1.5);

[Labels-attached-to-nodes Tree:]

[&R] (B:1.2,X:1.6,(A:1.1,((C:1.3,D:4.0)34:0.034,E:1.5):0.0126)12:0.026)26;

[Labels-attached-to-unrooted-bipartition-method Tree:]

[&U] (B:1.2,X:1.6,(A:1.1,((C:1.3,D:4.0)34.0:0.034,E:1.5)12.0:0.0126)26.0:0.026)1.0;

[labels-attached-to-rooted-bipartition-method Tree:]

[&R] (B:1.2,X:1.6,(A:1.1,((C:1.3,D:4.0)34.0:0.034,E:1.5)'Not_specified_in_original_tree':0.0126)'Not_specified_in_original_tree':0.026)1.0;

where Labels-attached-to-unrooted-bipartition-method is how users should move labels around if they represent split support for an unrooted tree.

lczech commented 8 years ago

Thanks for the clarifications!

The distinction between syntactic and semantic interpretation is indeed important. So far, I have (carelessly) used the latter, because it implies the former. But you are right that this should be clearly separated. Sorry for the confusion.

Also, defaulting to the syntactic interpretation that the data belongs to the nodes seems reasonable. Remark: by "asking the user" in case of a library, I meant something like offer a flag or the like.

As for the question whether the distinction between nodes and edges makes sense (i.e., is important at all), I'd say yes for the same reason as @jeetsukumaran: bipartitions - but also for edge-related data. See the referenced issue #54:

But really it is a waste of an edge object, because all the information associated with the "edge" can just be dumped onto the node, as there is a strict one-to-one correspondence between edges and the subtended nodes

That very much depends on your use case and your internal tree structure. I don't know how DendroPy stores trees, but I guess they are always rooted. Then, there is a one-to-one correspondence. However, and this was one of the points of our preprint, this is not the case for unrooted trees. For those, it is more natural to have actual edge data.

Also, speaking from a software modelling point of view, I always found it simpler to separate node and edge data. It makes usage clearer. (Maybe this is biased because the internal structure in my library http://genesis-lib.org/ is always unrooted, so that I need that distinction anyway.)

So, to finally come to the (current) issue title: "Can node labels be parsed as support values for tree rooting purposes?". You probably don't want to rewrite all the node/edge data handling in DendroPy (and shouldn't). But what do you think could be done to avoid problems related to the interpretation of those data?

jeetsukumaran commented 8 years ago

That very much depends on your use case and your internal tree structure. I don't know how DendroPy stores trees, but I guess they are always rooted. Then, there is a one-to-one correspondence. However, and this was one of the points of our preprint, this is not the case for unrooted trees. For those, it is more natural to have actual edge data.

(1) DendroPy handles both rooted and unrooted trees. When dealing with formats such as NEWICK and NEXUS, unless it is is explicitly told that the tree is rooted, it reads it as unrooted by default. This causes a lot of confusion with sloppy bioinformaticians who drop the "[&R]", do not read the documentation to use the "rooting='force-rooted'", and assume/need rooted trees. Unfortunately, not only has this behavior been the default since 1.0 and would be too late to change for legacy reasons, but it is sooooooo clearly documented in multiple places including the tutorial:

http://dendropy.org/primer/treemanips.html#rooting-derooting-and-rerooting

that the astonishment is, the sole responsibility of the astonished.

(2) You are absolutely right for unrooted trees, edges are important. But here you do NOT have the case where there is a clear, unambiguous one-to-one correspondence between edges and nodes. Here edges represent connections between bipartitions. This situation, then, is the second of the two perspectives that I describe, and, as I note, edges are absolutely meaningful.

Also, speaking from a software modelling point of view, I always found it simpler to separate node and edge data. It makes usage clearer. (Maybe this is biased because the internal structure in my library http://genesis-lib.org/ is always unrooted, so that I need that distinction anyway.)

Yes, but my point is in the case of one-to-one correspondence between edges and nodes, the distinction between edges and nodes can be abstracted away in the implementation with no detrimental effect or loss of information in the domain.

So, to finally come to the (current) issue title: "Can node labels be parsed as support values for tree rooting purposes?". You probably don't want to rewrite all the node/edge data handling in DendroPy (and shouldn't). But what do you think could be done to avoid problems related to the interpretation of those data?

The answer is "yes", in a number of different ways, though not in the one specific way you suggest above. And there is no need to rewrite all the node/edge data handling to do so. DendroPy already does natively.

(1) The ideal solution, but alas, very unrealistic one, is simply to move away from NEWICK. To DendroPy NEWICK is not the end-all-and-be-all of tree formats. It is, in fact, not only Just Another Tree Format, but a particularly sloppy and troublesome one. Unfortunately, it, or its superset NEXUS, is so widespread in phylogenetics to the point of being practically the only tree format in this domain despite more compelling alternatives.

There better formats out there, but the push to use this is not going to come from users. It is going to have to come from software developers who recognize the shortcomings and care enough to demand something better. Tools and software should continue to support NEWICK/NEXUS, of course, but as a secondary "import-only" format, with the primary format being something else. BEAST does this with their data.

Of course, if the choice of primary format is fragmented, then this is doomed because NEWICK/NEXUS will end up being the lingua franca format for various different programs that all have a different "better" format of their choice. And this, I think, is almost certainly going to be the case. So, this solution has very depressing non-fate in the near future unless things change radically :( . (And, this change, again, HAS to come from software developers by common agreement amongst themselves; users really don't care and will user whatever works and is most convenient for them).

(2) DendroPy also supports a very rich system of metadata annotations. This, too, can be used to communicate the information association and even interpretation. The limitations here are that with NEWICK/NEXML, there is no real standard in how the metadata should be expressed: there are numerous conventions, some more popular than others, and new ones coming up all the time. DendroPy currently supports comment-based metadata [support=1.0,color=red,population_size=301] in various flavors (e.g., NHX, BEAST) as well as a few others. The problem here is the expression of information is limited, and there is no standard set of fields. DendroPy uses "support" in its SumTrees, but that is a personal choice.

(3) Of course, as I note previously, it is incredibly trivial for a user to reassign node labels to edges:

     for nd in t: nd.edge.label = nd.label

Assuming a competent programmer, the above is no real hardship, but it is inefficient and irritating. So I think I agree that a simple flag at parse time, e.g. assign_internal_labels_to_edges=True will help a lot. I will implement this today itself.

(4) Providing built-in support for reading the node labels as SUPPORT values specifically is easily done. But I think this is a very narrow viewpoint. Those labels can be anything, of which support values are just ONE interpretation. Internal node taxon labels are also a very common use case. As are trait states. I myself frequently use them to express geographical distribution. The list goes on and on and on. I see no reason to elevate "support" to a special status requiring special code to handle and not any of the others. I am not sure everyone will agree. In fact, I know they don't. I get a lot of requests from users whose particular use case and needs so dominate their perspective, that they do not understand why DendroPy's defaults (such as unrooted trees, or not automatically reading internal labels as taxa) are what they are and want them changed. Given the wide variety of semantic interpretation of those labels, and given how easy DendroPy makes it for users to manipulate and operate on those labels once they have been parsed, I think that special-casing treatment of support is not needed, especially given the following ...

(5) DendroPy currently, lives much more upstream in a typical bioinformatics/evoinformatics pipeline than the other tools discussed here. So while it easily can, as noted above, CONSUME trees with support values, it is more often used to PRODUCE them. In fact, one might argue that this latter is by far the most popular usage of DendroPy, through its bundled application/tool SumTrees. DendroPy has a very, very, very, very extensive and rich infrastructure for calculating, summarizing, manipulating, and/or querying bipartitions and support of bipartitions on trees. And it gets these values from the raw data, not any labels. From this perspective, the "labels" on nodes are just a clumsy way of expressing the final result. And while DendroPy supports different ways of re-reading these results correctly (again, as shown above), this is not a common use case simply because most people at this point (i.e., after the support values have been calculated and expressed as node/edge labels) want to visualize trees. It is not as common for someone to want to carry out calculations on the processed support values expressed as labels (as opposed to calculating the support values from the raw data and then carrying out further calculations on them), and, in those rare cases, the few extra lines of code to correctly interpret/use those labels as bipartition support values, as shown above, is required on the part of the programmer.

-- jeet

On 5/18/16 8:38 AM, Lucas Czech wrote:

Thanks for the clarifications!

The distinction between syntactic and semantic interpretation is indeed important. So far, I have (carelessly) used the latter, because it implies the former. But you are right that this should be clearly separated. Sorry for the confusion.

Also, defaulting to the syntactic interpretation that the data belongs to the nodes seems reasonable. Remark: by "asking the user" in case of a library, I meant something like offer a flag or the like.

As for the question whether the distinction between nodes and edges makes sense (i.e., is important at all), I'd say yes for the same reason as @jeetsukumaran https://github.com/jeetsukumaran: bipartitions - but also for edge-related data. See the referenced issue #54 https://github.com/jeetsukumaran/DendroPy/issues/54:

But really it is a waste of an edge object, because all the
information associated with the "edge" can just be dumped onto the
node, as there is a strict one-to-one correspondence between edges
and the subtended nodes

That very much depends on your use case and your internal tree structure. I don't know how DendroPy stores trees, but I guess they are always rooted. Then, there is a one-to-one correspondence. However, and this was one of the points of our preprint, this is not the case for unrooted trees. For those, it is more natural to have actual edge data.

Also, speaking from a software modelling point of view, I always found it simpler to separate node and edge data. It makes usage clearer. (Maybe this is biased because the internal structure in my library http://genesis-lib.org/ is always unrooted, so that I need that distinction anyway.)

So, to finally come to the (current) issue title: "Can node labels be parsed as support values for tree rooting purposes?". You probably don't want to rewrite all the node/edge data handling in DendroPy (and shouldn't). But what do you think could be done to avoid problems related to the interpretation of those data?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/53#issuecomment-220013361


Jeet Sukumaran

jeetsukumaran@gmail.com

Blog/Personal Pages: http://jeetworks.org/ GitHub Repositories: http://github.com/jeetsukumaran Photographs (as stream): http://www.flickr.com/photos/jeetsukumaran/ Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

jhcepas commented 8 years ago

Assuming a competent programmer, the above is no real hardship, but it is inefficient and irritating. So I think I agree that a simple flag at parse time, e.g. assign_internal_labels_to_edges=True will help a lot. I will implement this today itself.

Thanks @jeetsukumaran - IMHO that would be very useful for people analyzing trees returned by Phyml, Fasttree, Raxml, IQTree and many other, which tend to produce bootstrap/support values as node labels (or plain comments) by default.

lczech commented 8 years ago

Thanks @jeetsukumaran for the detailed answer - I think we are getting to a solution here.

To (1): Unfortunately, no one reads manuals... So we can only hope to solve this by explicit APIs. To (1, the second one): Getting rid of those old, incomplete formats like Newick would be nice, I fully agree! But that is not the scope of our current project. To (2-4): Of course, support values are only an example of values that belong the branches. The point that @jhcepas and I were aiming at was actually about the distinction between node and edge data, not between possible usage of those. Sorry if that was not clear enough so far.

So, here is my summary of this issue, with a possible solution:

When working with phylogenetic trees, users often want to annotate metadata. Often, this data belongs to either the nodes or the branches of the tree. Independently of the interpretation of this data (which is up to the programmer), there is a problem with many file formats (particularly Newick) when storing such data. As we (unfortunately) cannot get rid of those formats easily, software still needs to support them - even if it is just for import. And particularly this import is a problem, because of the missing syntactic information whether certain metadata belongs to the nodes or edges of the tree. Thus, in order to aid the user, software should offer a way to distinguish between those two cases (again, independently of how the data is used in the end). For stand-alone applications, a user dialogue when reading such a file is a good way to do this. For libraries, the API should make it very clear where the metadata is going to end up (because no one reads the manual except for looking at function signatures). For example, by offering a flag in the reading function, or by offering different versions of the reading function.

My suggestion for DendroPy is thus, to incorporate such a flag (with default to store data in nodes, so that old code doesn't break).

Do you agree?

jeetsukumaran commented 8 years ago

To (1): Unfortunately, no one reads manuals... So we can only hope to solve this by explicit APIs.

Imagine this was a molecular sequencing protocol. And a biologist running it complained that they got no DNA because they were using it without reading the manual. Would this be acceptable? I do not see why programmers, biologist or otherwise, are absolved of the responsibility of not reading the manual.

The API is explicit.

It is explicitly described in the manual.

To (1, the second one): Getting rid of those old, incomplete formats like Newick would be nice, I fully agree! But that is not the scope of our current project. To (2-4): Of course, support values are only an example of values that belong the branches. The point that @jhcepas and I were aiming at was actually about the distinction between node and edge data, not between possible usage of those. Sorry if that was not clear enough so far. When working with phylogenetic trees, users often want to annotate metadata. Often, this data belongs to either the nodes or the branches of the tree. Independently of the interpretation of this data (which is up to the programmer), there is a problem with many file formats (particularly Newick) when storing such data. As we (unfortunately) cannot get rid of those formats easily, software still needs to support them - even if it is just for import. And particularly this import is a problem, because of the missing syntactic information whether certain metadata belongs to the nodes or edges of the tree. Thus, in order to aid the user, software should offer a way to distinguish between those two cases (again, independently of how the data is used in the end). For stand-alone applications, a user dialogue when reading such a file is a good way to do this. For libraries, the API should make it very clear where the metadata is going to end up (because no one reads the manual except for looking at function signatures). For example, by offering a flag in the reading function, or by offering different versions of the reading function.

(1) There is a problem only with the NEWICK (and, by extension, NEXUS) format.

(2) The real solution is to move away from the NEWICK format.

(3) You discount this solution, and instead decide to invest a lot of effort and worry in seeing how this format can be salvaged. Analogy: keeping financial accounts on wet tissue paper as a medium is fraught with dangers. A better solution is to use dry paper.

(4) The solution that you suggest is fragile and only applies in one restricted use case: i.e., where the metadata is ONLY applicable to the edges. What happens if there is a mix of metadata, e.g. some, like support values, applying to edges, but others, like trait states or internal taxon labels, applying to nodes? NEWCIK simply cannot handle that without a lot of contrivance.

(5) I also note that the problem you describe is a very small use case in the broad range of use cases that DendroPy deals with: i.e., rooting an unrooted tree with support expressed as internal labels. And it really is NOT a "problem" as such for DendroPy (in the sense that the correct results cannot be obtained). As I describe above, most people using DendroPy to calculate support simply (a) use SumTrees or directly using the underlying classes/methods/functions to calculate support from the raw data; or (b) are programmers who are aware of the limitations of the NEWICK format, and can easily switch the metadata association as shown above. Please see the code that I posted for switching the labels to branches or, even better, the code/results that @mtholder posted, which shows correct support values being recovered even when support values are read off node labels.

(6) That last point is worth repeating: DendroPy handles the use case of "rooting a tree with support expressed as labels" correctly, as long as the programer understands the limitations of the NEWICK format and how trees work. It is indeed good to raise awareness of the problem, but I do not agree that we should bend over backwards to change library code make up for programmers who do not due diligence in understanding the API, do not understand NEWICK, or do not understand how trees work.

(7) The second to last point, which I raised before but seems to have been missed, is also worth repeating: "rooting a tree with support expressed as labels" is not a primary use case from DendroPy's perspective. I can see how it is important or maybe even a dominant use case from a tree visualization tool perspective, or even a library that is tree visualization centric. But really, really, really: most people use DendroPy to generate trees for visualization tools to consume, and the support values are calculated/analyzed from the raw data.

My suggestion for DendroPy is thus, to incorporate such a flag (with default to store data in nodes, so that old code doesn't break). Do you agree?

Yes, I do agree. Thank you for the suggestion. I will do so. I was planning to do so yesterday, but did not have the time. Hopefully I will do so today this weekend. I think it is a useful addition that will make things a bit more efficient, a bit more convenient and a bit less less irritating in some use cases (by avoiding the need for the programmer to reassign the values themselves).

jeetsukumaran commented 8 years ago

Also, @lczech and @jhcepas, could I suggest that you update the code for the DendroPy tests that you link to so as to use the code that @mtholder posted, which, does, indeed, give the correct results?

lczech commented 8 years ago

The API is explicit.

Again, I am not criticizing DendroPy here. Just saying that in my opinion it is generally better for a library (any, not just DendroPy) to e.g. name functions and their parameters (flags) so that it is clear what they do. And of course, users are not absolved of the responsibility to read the manual. They just usually don't... Unfortunate as it is, we need to face this.

to (1-2): Agreed.

to (3): I discounted it because it is not easily done. That would need to come from the whole community, not just one library or one author. We could start some initiative in the future ;-)

to (4): There still seems to be confusion. If there is a mix of metadata, then of course, the programmer (user of the library) needs to do some more work. A simple flag ("write metadata to nodes or edges") is then no sufficient solution, but still more flexible then always storing it at nodes. However, if there is really such a case where some of the numbers in inner parts of the tree are intended for nodes and some others for edges, then there needs to be some other mechanism to distinguish that, thus Newick would be a really bad choice in the first place. So, users who would actually want to do that would have a really weird use case and would need to deal with this on their own anyway. In most cases however, the metadata is intended for either nodes or edges, so a flag should do.

to (5-7): Agreed, rerooting is a specific use case. However, we only used it as a practical example to show that the association of edges to nodes is fragile whenever the convention "edges are associated with the nodes at their distal end (the end away from the root)" is used. This convention is most prominent in Newick, but I have also seen it in actual tree data structures, so it might also be an internal invariant of programs and libraries. And as such, it might break in cases like rerooting, if not properperly treated (again, not criticizing DendroPy here - just a general caveat). Our paper was intended to raise awareness of this general issue. Maybe we should have stressed that more.

Yes, I do agree. Thank you for the suggestion. I will do so. I was planning to do so yesterday, but did not have the time. Hopefully I will do so today this weekend. I think it is a useful addition that will make things a bit more efficient, a bit more convenient and a bit less less irritating in some use cases (by avoiding the need for the programmer to reassign the values themselves).

Nice. That was our intention all along - and not to criticize DendroPy or its handling of tree metadata. Glad that we found consensus ;-)

jhcepas commented 8 years ago

I have updated the tests. They include now the rooted_bipartition_method. https://github.com/jhcepas/test_branch_support_after_tree_rerooting#test_dendropypy Let me know if this eventually becomes a flag in dendropy newick parser. thanks!

jeetsukumaran commented 8 years ago

Implemented in development-master: https://github.com/jeetsukumaran/DendroPy/blob/development-master/dendropy/dataio/newickreader.py#L184-L185