Closed erikrikarddaniel closed 1 year ago
Hi Daniel,
I am not entirely sure in which way these trees are corrupt, or rather, what exactly you would expect instead. In essence, it seems to me a mixture of a shortcoming of the newick file format, and a workaround in my code that might cause the trouble here:
"B4F2Z1_PROMH_(B4F2Z1)":0.158438
for your tree. So, to solve this, I'd be interested in the following:
Cheers and so long Lucas
I use Dendroscope, which might of course be the culprit.
On closer inspection it seems the trees generated by gappa are identical except for the sequence names and that it's the parentheses in the names that causes the problems. (I should have thought about that!) Gappa seems to have replaced spaces with underscores.
Is replacing parentheses something you'd want to implement?
Ahh haha right, the parenthesis are also what causes genesis to put this in quotation marks in the first place... Sure, happy to implement that. What exactly would be good/expected behavior there? Replace all illegal characters with underscores?
That's what I always do except, apparently, this time! Which characters are illegal tend to differ though.
For newick, typically I'd consider :;()[],
to be special, as well as space, and if jplace files are involved, also {}
. Other than that, all printable characters should work.
According to http://evolution.genetics.washington.edu/phylip/newicktree.html:
A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets.
Well, they forgot to mention commas there... That source also contains the reference for why I replaced spaces with underscores:
Because you may want to include a blank in a name, it is assumed that an underscore character ("_") stands for a blank; any of these in a name will be converted to a blank when it is read in.
So, in summary, all the above illegal characters can either be put in quotation marks, which is non-standard, or replaced by underscores, which is also non-standard, but at least should work with downstream tools and tree viewers... I'll implement both as options in gappa. Not sure when I'll get to do this though :-(
Thanks!
Okay, I've added a flag to the Newick output, see for example the graft command. As of now, this is only available when compiling the latest source code. It will be part of the next release version (either v0.8.3 or v0.9.0, depending on how much else will change by then), once I've fixed the issue with conda not updating automatically any more.
Stay tuned for that. Closing the issue for now, but feel free to re-open should you have trouble with the new option.
Cheers Lucas
I ran RAXML-NG on three test sequences that happened to have spaces in their names, then
gappa examine graft
, but the tree got corrupt somehow (see attached filesspaces.*
). After modifying the files to get rid of the spaces, the tree looks fine (nospaces.*
).spaces.newick.txt spaces.jplace.txt nospaces.newick.txt nospaces.jplace.txt