NEXUS output format tweaks.

SimonGreenhill commented 7 years ago

Looking at the NEXUS template in wordlist.py I can see a few tweaks that can be made.

I'm happy to make the changes and make a pull request, just need confirmation/thoughts where I've tagged you below.

The format declaration needs the attribute symbols which should list the symbols in use in the data. For PAPS this should just be 0 and 1 I think (@LinguList)? We could hard code this as:

FORMAT DATATYPE=STANDARD GAP=- MISSING={2} SYMBOLS="01" interleave=yes;

... unless there's an easy way to get a list of symbols in use in the PAPS output (@xrotwang?)

The Nexus format can specify which characters are which using a CHARSTATELABELS command which looks like this:


BEGIN DATA;
    DIMENSIONS NTAX=3 NCHAR=3;
    FORMAT DATATYPE=STANDARD MISSING=? GAP=- SYMBOLS="01";
    CHARSTATELABELS
        1 char1,
        2 char2,
        3 char3
    ;
    MATRIX

This could replace/augment the [PAPS-REFERENCE] thing at the end of the template, unless PAPS-REFERENCE is being used for anything (I suspect it's just for documentation @LinguList?)

"ntax" should be "NTAX" for consistency (and interleave -> INTERLEAVE). The NEXUS format doesn't care if it's upper or lower case, but we should be consistent. Any preference?

LinguList commented 7 years ago

I have written a new nexus-export function, I'll push to my lingulist/lingpy in a second, which does a bit more, based on a list of taxonomic units and a matrix for characters. It also includes, e.g., (a,b) states, which are useful for handling uncertainty, also in terms of borrowings, I think. This version is much more powerful than the current one, and I also have the idea of adding certain things simply into a lingpy-block, like:

BEGIN LINGPY;

character information
;

Similarly, I know now that MrBayes has a principled way to define character blocks (e.g., semantic units), and it would be cool to be able to have the same, say, for BEAST. I'm thinking of having one export for MrBayes, as it is still straightforward to be used for quick analyses, and one for BEAST.

And reading your point 2 now: also a very nice idea. I'll show my code in a minute, and I suppose you could make a PR, and we then discuss the tweaks?

SimonGreenhill commented 7 years ago

BEAST2 now has these too:

begin assumptions;
        charset I_me = 1-20;
        charset alcoholliquor = 21-61;
        charset alivelive = 62-97;
        charset ant = 98-127;
        charset arrow = 128-159;
    ...
end;

...which is useful because it helps set up more complex analyses and documents the data better ("assumptions" is the accepted way to specify partitions in nexus data)

And yes, every program seems to have its own block, so why not lingpy too!

LinguList commented 7 years ago

Heres the mrbayes template, and here is the idea for a more flexible write_nexus code.

Ideally, we would have the same code, but by changing the template would either have good nexus for beast, or good nexus for mrbayes, etc., maybe even for paup (this is the nexus multistate format).

Right now, the paps-matrix can have the form:

[
    [[0], [1], [0, 1]],
    [[0], [1, 0], [0]]
]

This is then rendered as:

01(01)
0(10)0

And this is the same format as expected in multistate paup.

LinguList commented 7 years ago

As to the beast blocks: this is cool, as the current code already computes this behaviour (only the "assumptions" needs to be changed).

And the lingpy-block might be useful to also allow for unicode-input (but I did not test yet, yet I know that you can't use unicode for charset in mrbayes, and I suppose, beast is similar). But if other programs just ignore a lingpy block, this would even be nicer. If not, we should use the html-way to automatically convert from unicode, as this will be uncontroversial, I guess.

But having a lingpy block, one might even fantasize of reading nexus files in lingpy in the future, in order to integrate results written to nexus by other programs...

SimonGreenhill commented 7 years ago

Looks good! In terms of adding beast/mrbayes blocks, I think that can wait until further down the track and we learn what people want to do. I'd stick with the core capabilities of CHARSTATELABELS and ASSUMPTIONS (which I think MrBayes, BEAST, PAUP* and Mesquite can handle, and they're the most of the programs that people use).

One comment on a PAPS of (0,1) -- this is equivalent to "-", so you could just do that? I know from painful experience a lot of phylogenetic programs do not like these (10) things -- some crash, some treat "(10)" as state 10, some treat it as state "1" and state "0". Some programs (Mesquite?) want "{1,0}". 💩

And that sounds like a nice future plan!

LinguList commented 7 years ago

Yep, sounds good to use the charstaelabels and assumptions, as you mentioned. And regarding paps, this will have to be handled differently using some parameters, I'm afraid. But for mrbayes, it assumes you have polymorphisms, or uncertainties, and this is nice, as it may be used for uncertain borrowings. For Paup, there is another multi-state export in lingpy, where there is no comma, so if you pass a matrix with multistates (a, b, c), the current write_nexus would handle that correctly. But we'll need to see, as you say, what users (or we, as the primary users ;-) might want, and will implement on this basis, I suppose, that is: when we feel the need. Right now, BEAST and MrBayes seems quite useful as targets.

SimonGreenhill commented 5 years ago

This is fixed now right?

LinguList commented 5 years ago

Yes, and lingpy3 is on hold anyway...

lingpy / lingpy3

NEXUS output format tweaks. #10