lczech / gappa

A toolkit for analyzing and visualizing phylogenetic (placement) data
GNU General Public License v3.0
56 stars 7 forks source link

Spaces in sequence names leads to corrupt trees from gappa examine graft #19

Closed erikrikarddaniel closed 1 year ago

erikrikarddaniel commented 1 year ago

I ran RAXML-NG on three test sequences that happened to have spaces in their names, then gappa examine graft, but the tree got corrupt somehow (see attached files spaces.*). After modifying the files to get rid of the spaces, the tree looks fine (nospaces.*).

spaces.newick.txt spaces.jplace.txt nospaces.newick.txt nospaces.jplace.txt

lczech commented 1 year ago

Hi Daniel,

I am not entirely sure in which way these trees are corrupt, or rather, what exactly you would expect instead. In essence, it seems to me a mixture of a shortcoming of the newick file format, and a workaround in my code that might cause the trouble here:

So, to solve this, I'd be interested in the following:

Cheers and so long Lucas

erikrikarddaniel commented 1 year ago

I use Dendroscope, which might of course be the culprit.

On closer inspection it seems the trees generated by gappa are identical except for the sequence names and that it's the parentheses in the names that causes the problems. (I should have thought about that!) Gappa seems to have replaced spaces with underscores.

Is replacing parentheses something you'd want to implement?

lczech commented 1 year ago

Ahh haha right, the parenthesis are also what causes genesis to put this in quotation marks in the first place... Sure, happy to implement that. What exactly would be good/expected behavior there? Replace all illegal characters with underscores?

erikrikarddaniel commented 1 year ago

That's what I always do except, apparently, this time! Which characters are illegal tend to differ though.

lczech commented 1 year ago

For newick, typically I'd consider :;()[], to be special, as well as space, and if jplace files are involved, also {}. Other than that, all printable characters should work.

According to http://evolution.genetics.washington.edu/phylip/newicktree.html:

A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets.

Well, they forgot to mention commas there... That source also contains the reference for why I replaced spaces with underscores:

Because you may want to include a blank in a name, it is assumed that an underscore character ("_") stands for a blank; any of these in a name will be converted to a blank when it is read in.

So, in summary, all the above illegal characters can either be put in quotation marks, which is non-standard, or replaced by underscores, which is also non-standard, but at least should work with downstream tools and tree viewers... I'll implement both as options in gappa. Not sure when I'll get to do this though :-(

erikrikarddaniel commented 1 year ago

Thanks!

lczech commented 1 year ago

Okay, I've added a flag to the Newick output, see for example the graft command. As of now, this is only available when compiling the latest source code. It will be part of the next release version (either v0.8.3 or v0.9.0, depending on how much else will change by then), once I've fixed the issue with conda not updating automatically any more.

Stay tuned for that. Closing the issue for now, but feel free to re-open should you have trouble with the new option.

Cheers Lucas