jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
210 stars 61 forks source link

Special characters in input tree tip labels causes unpredictable renaming #144

Closed AlesBucek closed 6 months ago

AlesBucek commented 2 years ago

Hi, tip labels containing special character (tested with "-") will be quoted in the output tree of sumtrees.py. However, if such tip label contains also character , this will be replaced with space. Otherwise, is not replaced with space. This seems like a hard to predict renaming pattern - underscore should be retained and not renamed to space or there should be an option allowing to choose how _ should be treated.

Example: input file "trees.txt" with two trees:

((1_Species-1_sp,1_Species2_sp),1_Species3_sp);
((1_Species-1_sp,1_Species2_sp),1_Species3_sp);

command: sumtrees.py --summary-target=consensus --min-clade-freq=1 trees.txt

mmore500 commented 1 year ago

Able to reproduce,

tree.txt

((1_Species+1_sp,1_Species2_sp),1-Species+sp);

python3 applications/sumtrees/sumtrees.py tree.txt produces

...
BEGIN TAXA;
    DIMENSIONS NTAX=3;
    TAXLABELS
        '1 Species+1 sp'
        1_Species2_sp
        '1-Species+sp'
  ;
END;

BEGIN TREES;
    TREE 1 = [&U] ('1 Species+1 sp':0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}],1_Species2_sp:0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}],'1-Species+sp':0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}])1.00000000:0.0[&support=1.0][&length_mean=0.0,length_median=0,length_sd=inf,length_hpd95=None,length_quant_5_95={0,0},length_range={0,0}];
END;
...
mmore500 commented 1 year ago

edit: this is not the case, disabling these lines does not affect outcome

Suspect culprit is interaction of one of these lines

https://github.com/jeetsukumaran/DendroPy/blob/c31c1d986b0e6d72aa90e4fef61707e055ce3779/applications/sumtrees/sumtrees.py#L1385 https://github.com/jeetsukumaran/DendroPy/blob/c31c1d986b0e6d72aa90e4fef61707e055ce3779/applications/sumtrees/sumtrees.py#L1808

with library deserialization logic that assumes preserve_underscores is handled internally and treats strings with and without spaces separately, like (but not literally)

https://github.com/jeetsukumaran/DendroPy/blob/c31c1d986b0e6d72aa90e4fef61707e055ce3779/src/dendropy/dataio/nexusprocessing.py#L475

mmore500 commented 1 year ago

Actual explanation (I think):

  1. when deserializing, all underscores are ripped out
    • if preserve_underscores=False, as is the case for sumtrees.py
  2. internal representation with and without - character has space instead of underscores
  3. re-serialization differs depending on whether - character is contained

as illustrated below

>>> import dendropy
>>> 
>>> s1 = "(A,(b-_b,c_c));"
>>> tree1 = dendropy.Tree.get(
...          data=s1,
...          schema="newick",
...          preserve_underscores=False,
... )
>>> 
>>> print(tree1.as_string("newick"))
(A,('b- b',c_c));

>>> 
>>> for node in tree1:
...   print(node)
... 
<Node object at 0x7f172df8fd90: 'None' (None)>
<Node object at 0x7f172df8fd60: 'None' (<Taxon 0x7f172df8fca0 'A'>)>
<Node object at 0x7f172df8fc40: 'None' (None)>
<Node object at 0x7f172df8fbe0: 'None' (<Taxon 0x7f172df8fac0 'b- b'>)>
<Node object at 0x7f172df8fa60: 'None' (<Taxon 0x7f172df8fa00 'c c'>)>
mmore500 commented 1 year ago

It looks like this comes down to a technicality of NEXUS format, which the Newick output is sharing logic with. https://doi.org/10.1093/sysbio/46.4.590

The relevant section is at the end

Word.—Except for special cases involving quotes or comments, a NEXUS word is any string of text characters that is bounded by whitespace or punctuation and that does not contain whitespace or punctuation. If the first character of a word is a single quote, then the word ends with the next single quote (unless that single quote is in a pair of consecutive single quotes; if so, then the word ends at the first unpaired single quote). Any character, including punctuation and whitespace, may be contained within a quoted word. A word cannot consist of only whitespace and punctuation. On each of the following lines is a single legal word:

  • Bembidion
  • B._zephyrum
  • 'John' 's sparrow (eastern) '

Underscores are considered equivalent to blank spaces, except that underscores are dark characters and blank spaces are whitespace. Thus, a program encountering B._zephyrum and 'B. zephyrum' should judge them to be identical.

Punctuation.—Any of the following text characters are considered punctuation at some times: ( ) [ ] { } / \ , ; : = * ' " " + - < > The following punctuation marks have special properties: [ ] do not break a word; + and - are allowed as state symbols, but none of the rest are allowed; - is considered punctuation except where it is the minus sign in a negative number.

As an aside, Newick does allow - characters and several others in words, just not blanks, parentheses, square brackets, single quotes, colons, semicolons, or commas. Newick also considers _ equivalent with (space). https://evolution.genetics.washington.edu/phylip/newick_doc.html

mmore500 commented 1 year ago

I think the solution here will be to create specialized logic for the (less restrictive) escaping of newick words and replace the current calls to escape_nexus_token

mmore500 commented 6 months ago

Looking into this further, it seems that this occurs because of this rule in NEXUS.

Underscores are considered equivalent to blank spaces, except that underscores are dark characters and blank spaces are whitespace. Thus, a program encountering B._zephyrum and 'B. zephyrum' should judge them to be identical.

So, when the quoting is forced by other special characters, _ must be rewritten as when it occurs inside the quotes. As strange as it may be, this behavior is consistent with NEXUS and is as expected.

As a reference, you can have more fine grained control when serializing/deserializing through the tree interface

>>> import dendropy                                                                            
>>> newick_str = "((1_Species+1_sp,1_Species2_sp),1-Species+sp);"  
>>> tree = dendropy.Tree.get(
...     data=newick_str,
...     schema="newick",
...     preserve_underscores=True,
... )
>>> tree.as_string(schema="newick").strip()
"(('1_Species+1_sp','1_Species2_sp'),'1-Species+sp');"