lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
17 stars 7 forks source link

Grouped Sounds Notation in Lexibank and other libraries #258

Open LinguList opened 2 years ago

LinguList commented 2 years ago

Due to past efforts in automatic reconstruction, and individual tests in EDICTOR on individual datasets, I have realized that we can avoid having problematic alignments by introducing a "grouped-sounds notation" for sequences. This means, if I want to say that two sounds should form a unit, I separate them no longer by a space, but by a dot. This allows me to match, e.g., k.j vs. ts. We can also circumvent the problem of many diphthong vs. monophtong decisions, if we allow to notate a u as a.u where we are not sure. I am writing on a short article that shows how this can be very helpful in many approaches, specifically in alignments, where it avoids gaps, and gaps are always a problem, as they are often unmotivated (consider k j a ŋ vs. ts ã, which involves two gaps, but no gap if we resort to k.j a.ŋ for the former).

In orthography profiles this notation can be introduced with the profile. We can even introduce it only implicitly by (ab)using the slash notation, writing k .j/j a .ŋ/ŋ, which can be converted to k.j a.ŋ with a very short function:

def group_sounds(segments): 
    out = []
    for segment in segments:
        if "/" in segment:
            one, two = segment.split("/")
            if one.startswith("."):
                out[-1] += one
            else:
                out += [one]
        else:
            out += [segment]
    return out

In Lexibank, we can add a GroupedSegments to the FormTable in which the ungrouped Segments are grouped. Grouping can even be done later on the fly.

On the long run, when this has been properly tested, I'd however, suggest to make this part of normal Segments, and check for CLTS compatibility for the grouped elements individually rather than as a bunch, which would require to modify the pylexibank code.