lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Is handling of multiple values correct? #225

Open Anaphory opened 5 years ago

Anaphory commented 5 years ago

My synonym handling code implies a certain difference between ? and -. I would like to for various combinations of unknown, missing, absent, and present values, what the binarization and single-value summary that these should look like. Ideally, this would be in a format that could be turned into a list of test cases.

F1: a ternary feature (ABC), to be binarized.

Lg,Ft,V
l1,F1,A
l2,F1,C
l3,F1,A
l3,F1,B
l4,F1,-
l6,F1,?
l7,F1,A
l7,F1,?

should in my opinion give the following alignments:

    taxon="l1" value="100"
    taxon="l2" value="001"
    taxon="l3" value="110"
    taxon="l4" value="000"
    taxon="l5" value="000"
    taxon="l6" value="???"
    taxon="l7" value="1??"

I'm not sure about l5, though. I think this is the one easier to implement, but a value not being listed in the data might equally mean that it is unknown.

lmaurits commented 5 years ago

Re: l5, I would say that "explicit is better than implicit". If we know that a language is missing a feature, we should say so by putting - in the data. If a value is not listed in the data, we should assume as little as possible and treat it as missing.

Re: l7, what's a real world situation where you would code something like this?

Anaphory commented 5 years ago

Re: l7, what's a real world situation where you would code something like this?

“Our wordlist, which was cognate-coded by an expert on l7, contains two forms for l7. The expert said that the first form is in class A, but forgot to code the second form.”

There may be better use cases.