Aligning Proto-Forms against their reflexes

LinguList commented 9 years ago

Now that I added the proto-forms as "simple languages" (language *PT in the Edictor), all proto-forms should be aligned to their reflexes. In this way, we can later on check and model how the proto-forms changed into the reflexes. We can use this to test,

how well the proto-language can predict the daughter languages, and
which sound changes frequently occur in the data, and
which cases of sound changes needed to model the data are problematic

For all of this, we'll need the alignments.

thiagochacon commented 9 years ago

Great. I think I can add here that suggestion I made by email. I suggested we should try to "hierarchize" the proto-form with the descendant forms. This could be helpful for two main problematic cases in the alignment: metathesis, phonological splits and mergers.

If we work with some sort of hierarchy, we could link the particular reflexes with a proto-form cell (i.e reconstructed sound). The normal/unmarked situation could be handled with the alignment proper. Otherwise, we could link a particular reflex to one or more proto-form cells.

Suppose we have the following scenario Proto-L XYZ L1 XYW L2 XZY L3 XYAB L4 XT

In this scenario L1 W would be aligned, thus automatically linked with PL Z L2 Z would be linked to PL Z. L3 AB would be linked to PL Z. L4 T would be linked to PL YZ.

Do you think this would be a good idea? How far/close are we to manage that with the current status of the alignment tech?

LinguList commented 9 years ago

Easiest and most straightforward approach here is to add another column containing the "linking". This would start from the proto-form in it's tokenized representation (that is, the "TOKENS" columns). Now, we could use some easy-to-define markup in which for each reflex the relation to the proto-form is defined. This would come close to Pauls solution he presented.

A possible example for markup would be:

PROTO X/1 Y/2 Z/3
L1 X/1 Y/2 W/3
L2 X/1 Z/3 Y/2
L3 X/1 Y/2 A/3 B/3
L4 X/1 T/2,3

Here, numbers in reflexes refer to numbers defined in Proto-Forms.

I could also write a tool similar to the alignment editor which would display these internal formats nicely or allow for quick editing.

But before starting to work on technical solutions here, I suggest we use this issue to collect the cognate sets where such a representation is actually needed. If, in the end, it is only two cases or so, we might come up with an easier solution. If not, the examples will help us to identify which functionality we need in the end.

glottobank / tukano

Aligning Proto-Forms against their reflexes #12