Closed AlesBucek closed 5 months ago
Able to reproduce,
tree.txt
((1_Species+1_sp,1_Species2_sp),1-Species+sp);
python3 applications/sumtrees/sumtrees.py tree.txt
produces
...
BEGIN TAXA;
DIMENSIONS NTAX=3;
TAXLABELS
'1 Species+1 sp'
1_Species2_sp
'1-Species+sp'
;
END;
BEGIN TREES;
TREE 1 = [&U] ('1 Species+1 sp':0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}],1_Species2_sp:0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}],'1-Species+sp':0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}])1.00000000:0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}];
END;
...
edit: this is not the case, disabling these lines does not affect outcome
Suspect culprit is interaction of one of these lines
https://github.com/jeetsukumaran/DendroPy/blob/c31c1d986b0e6d72aa90e4fef61707e055ce3779/applications/sumtrees/sumtrees.py#L1385 https://github.com/jeetsukumaran/DendroPy/blob/c31c1d986b0e6d72aa90e4fef61707e055ce3779/applications/sumtrees/sumtrees.py#L1808
with library deserialization logic that assumes preserve_underscores
is handled internally and treats strings with and without spaces separately, like (but not literally)
Actual explanation (I think):
preserve_underscores=False
, as is the case for sumtrees.py
-
character has space instead of underscores-
character is containedas illustrated below
>>> import dendropy
>>>
>>> s1 = "(A,(b-_b,c_c));"
>>> tree1 = dendropy.Tree.get(
... data=s1,
... schema="newick",
... preserve_underscores=False,
... )
>>>
>>> print(tree1.as_string("newick"))
(A,('b- b',c_c));
>>>
>>> for node in tree1:
... print(node)
...
<Node object at 0x7f172df8fd90: 'None' (None)>
<Node object at 0x7f172df8fd60: 'None' (<Taxon 0x7f172df8fca0 'A'>)>
<Node object at 0x7f172df8fc40: 'None' (None)>
<Node object at 0x7f172df8fbe0: 'None' (<Taxon 0x7f172df8fac0 'b- b'>)>
<Node object at 0x7f172df8fa60: 'None' (<Taxon 0x7f172df8fa00 'c c'>)>
It looks like this comes down to a technicality of NEXUS format, which the Newick output is sharing logic with. https://doi.org/10.1093/sysbio/46.4.590
The relevant section is at the end
Word.—Except for special cases involving quotes or comments, a NEXUS word is any string of text characters that is bounded by whitespace or punctuation and that does not contain whitespace or punctuation. If the first character of a word is a single quote, then the word ends with the next single quote (unless that single quote is in a pair of consecutive single quotes; if so, then the word ends at the first unpaired single quote). Any character, including punctuation and whitespace, may be contained within a quoted word. A word cannot consist of only whitespace and punctuation. On each of the following lines is a single legal word:
- Bembidion
- B._zephyrum
- 'John' 's sparrow (eastern) '
Underscores are considered equivalent to blank spaces, except that underscores are dark characters and blank spaces are whitespace. Thus, a program encountering
B._zephyrum
and'B. zephyrum'
should judge them to be identical.Punctuation.—Any of the following text characters are considered punctuation at some times: ( ) [ ] { } / \ , ; : = * ' " " + - < > The following punctuation marks have special properties: [ ] do not break a word; + and - are allowed as state symbols, but none of the rest are allowed; - is considered punctuation except where it is the minus sign in a negative number.
As an aside, Newick does allow -
characters and several others in words, just not blanks, parentheses, square brackets, single quotes, colons, semicolons, or commas. Newick also considers _
equivalent with
(space). https://evolution.genetics.washington.edu/phylip/newick_doc.html
I think the solution here will be to create specialized logic for the (less restrictive) escaping of newick words and replace the current calls to escape_nexus_token
Looking into this further, it seems that this occurs because of this rule in NEXUS.
Underscores are considered equivalent to blank spaces, except that underscores are dark characters and blank spaces are whitespace. Thus, a program encountering B._zephyrum and 'B. zephyrum' should judge them to be identical.
So, when the quoting is forced by other special characters, _
must be rewritten as
when it occurs inside the quotes. As strange as it may be, this behavior is consistent with NEXUS and is as expected.
As a reference, you can have more fine grained control when serializing/deserializing through the tree interface
>>> import dendropy
>>> newick_str = "((1_Species+1_sp,1_Species2_sp),1-Species+sp);"
>>> tree = dendropy.Tree.get(
... data=newick_str,
... schema="newick",
... preserve_underscores=True,
... )
>>> tree.as_string(schema="newick").strip()
"(('1_Species+1_sp','1_Species2_sp'),'1-Species+sp');"
Hi, tip labels containing special character (tested with "-") will be quoted in the output tree of sumtrees.py. However, if such tip label contains also character , this will be replaced with space. Otherwise, is not replaced with space. This seems like a hard to predict renaming pattern - underscore should be retained and not renamed to space or there should be an option allowing to choose how _ should be treated.
Example: input file "trees.txt" with two trees:
command:
sumtrees.py --summary-target=consensus --min-clade-freq=1 trees.txt