jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
210 stars 61 forks source link

Correctly load newick node annotations containing nested lists #145

Open willdumm opened 1 year ago

willdumm commented 1 year ago

Thanks for the great tool!

We have some trees from BEAST in extended newick format, and each of their nodes has an annotation containing a nested list. Here's a minimal example of a tree containing a single node, with the sort of annotations I'm talking about:

'1:[&rate=1.0,mutations=3.0,history_all={{57,0.08,C,T},{134,0.079,A,G},{4,0.07,C,T}}]1;'

Dendropy loads this tree without errors, but parses the value of the history_all field incorrectly:

>>> t = dendropy.Tree.get(data='1:[&rate=1.0,mutations=3.0,history_all={{57,0.08,C,T},{134,0.079,A,G},{4,0.07,C,T}}]1;', schema='newick')
>>> t.seed_node.annotations.get_value('history_all')
['{57', '0.08', 'C', 'T']

As you can see, only the first sublist is parsed (up until the first closing }), and the first item contains the opening bracket of the first sublist.

I know I can pass the Tree.get method the keyword argument extract_comment_metadata=False, and parse the resulting node.comments string myself. That's a nice workaround, but I'm wondering if there's a way I'm not seeing to provide a custom annotation string parser, or if there would be some other easy fix for this behavior?

cc @matsen