Open grahamgower opened 3 years ago
Sigh - I think this basically means conversion will never work for ms. Unless we have sufficient precision on the branch lengths the whole strategy fails because we rely on being able to uniquely identify nodes by their times. If we only have three digits of precision, then that's not going to work.
Yeah, it does seem problematic. Annoyingly, the ms sources have this hardcoded to 3 decimal places. In most cases, I think mspms -p xx
is a pretty reasonable substitute, which applies the precision to all output. Diff below for when you really need to test with ms. But as you say, there's no solution if you already have a ton of ms output laying around and want to do something useful with it.
$ diff -u ms.c.bak ms.c
--- ms.c.bak 2021-07-12 08:28:08.726221170 +0200
+++ ms.c 2021-07-12 08:28:12.682762440 +0200
@@ -919,7 +919,7 @@
double time ;
if( descl[noden] == -1 ) {
- printf("%d:%5.3lf", noden+1, (ptree+ ((ptree+noden)->abv))->time - (ptree+noden )->time ); /* adna */
+ printf("%d:%5.13lf", noden+1, (ptree+ ((ptree+noden)->abv))->time - (ptree+noden )->time ); /* adna */
}
else{
printf("(");
@@ -929,7 +929,7 @@
if( (ptree+noden)->abv == 0 ) printf(");\n");
else {
time = (ptree + (ptree+noden)->abv )->time - (ptree+noden)->time ;
- printf("):%5.3lf", time );
+ printf("):%5.13lf", time );
}
}
}
we rely on being able to uniquely identify nodes by their times
Maybe I'm missing something unique regarding converting these ms newicks, but the new newick
module based parser makes no such constraint on node times, it should be possible to use it to parse these?
This is a separate issue to the ultrametricity thing @benjeffery.
The assumption we're making about nodes (in the tree sequence sense) in ms output @benjeffery is that they're uniquely identified by their times. There can only be one coalescence event at one time in the coalescent, so this is a good assumption (for ms at least, probably not for other programs using ms output). If the branch lenghts aren't output with high precision, then this won't work. We'll have no way of identifying the tree sequence nodes in different trees, so we'd end up with a JBOT (if we supported it).
Yuck! I get it now. Not much point in putting any of effort into from_ms
then.
Well, ms output may be problematic, but there are a range of other programs that output ms format. In particular, mspms -p 12 ...
seems to work just fine. I guess that scrm
output would also be usable.
It seems the precision of the branch lengths are not sufficient. The "precision"
-p
parameter has no effect on this.