Multiple orthography profiles

FredericBlum commented 3 years ago

If I were to collect more data in order to expand the wordlist, It would be great to find a way for a single cldf-conversion which includes different orthography profiles or, in case thats possible, an orthogrpahy profile with language-specific segments.

FredericBlum commented 3 years ago

if a directory etc/orthography/ exists, all *.tsv files in it will be considered orthography profiles, and a dict mapping filename stem to tokenizer will be available. Tokenizer selection can be controlled in two ways:

Passing a keyword profile=FILENAME_STEM in Dataset.tokenizer() calls.

Provide orthography profiles for each language and let Dataset.tokenizer chose the tokenizer by item['Language_ID'].

@LinguList Is this still the preferred way of implementing multiple orthography profiles? Do I need to add any information to the lexibank-script or will the assignment proceed automatically once the directory and the respective profiles exist?

LinguList commented 3 years ago

My preferred way is now:

make a script for ALL the orthography at once, even if this is imperfect, mark imperfect cases, so you find them
use one script I made to create INDIVIDUAL orthography profiles and put them in the etc/orthography/*.tsv as separate files, follwoing the language ID
refine from there

but you can also start directly and use the two in combination, i.e., one etc/orthography.tsv along with etc/orthography/Maya.tsv.

FredericBlum commented 3 years ago

Would you consider sharing that script, so that we could make for example an individual orthography file for each language involved in the conversion of this repository? This would make it a lot easier to combine the data with data from other varieties and to derive all of them in a single cldf-conversion instead of splitting everything across repositories. I guess I could just copy the file 15 times and rename, but if there is a script, why not use it. We are currently aiming to finish the collection of data until the end of march.

Another small question that came to my mind: In order to filter out any borrowings for any phylogenetic model, I guess the easiest way is to load the wordlist into LingPy and create a subset of entries which includes only non-borrowed words, correct? You used a similar way to filter for a specific subgroup in the LingPy tutorial, but I don't see why it wouldn't work for borrowings as well, or is there another recommended workflow?

LinguList commented 3 years ago

Sure, that was of course implied ;) but I'd suggest to do so only AFTER you have a first draft profile for all languages. If this is the case, we can start immediately.

LinguList commented 3 years ago

from csvw.dsv import UnicodeDictReader
from collections import defaultdict
from clldutils.text import strip_brackets
from unicodedata import normalize

with UnicodeDictReader('../cldf/forms.csv') as reader:
    data = [row for row in reader]

with UnicodeDictReader('../etc/orthography.tsv', delimiter="\t") as reader:
    profile = {}
    for row in reader:
        profile[normalize('NFC', row['Grapheme'])] = row['IPA']

languages = set([row['Language_ID'] for row in data])

profiles = {language: defaultdict(int) for language in languages}

errors = {}

lexemes = {}
for row in data:
    for char in row['Graphemes'].split():
        char = normalize('NFC', char)
        profiles[row['Language_ID']][char, profile.get(char, '?'+char)] += 1

    if [x for x in '˩˨˧˦˥' if x in row['Value']]:
        tone = False
        out = []
        for char in strip_brackets(row['Form']).split(' '):
            char = normalize('NFC', char)
            if [x for x in '˩˨˧˦˥' if x in char]:
                if tone:
                    out[-1] += char
                else:
                    out += [char]
                tone = True
            else:
                tone = False
                try:
                    out[-1] = char+out[-1]
                except IndexError:
                    out += [char]
        lexemes[row['Value']] = ''.join(out)

for language in languages:
    with open('../etc/orthography/'+language+'.tsv', 'w') as f:
        f.write('Grapheme\tIPA\tFrequency\n')
        for (char, ipa), freq in profiles[language].items():
            f.write('{0}\t{1}\t{2}\n'.format(char, ipa, freq))
            if ipa.startswith('?'):
                errors[char] = ipa[1:]

with open('../etc/lexemes2.tsv', 'w') as f:
    for a, b in lexemes.items():
        f.write(a+'\t'+b+'\tswap tones\n')

with open('addons.tsv', 'w') as f:
    for a, b in errors.items():
        f.write(a+'\t'+b+'\n')

LinguList commented 3 years ago

BTW: this is pasted from another script I used, so you may need to adjust variable names (e.g., "lexemes2.tsv"), but since your data doesn't have tones, and this is only to create the first profile, you should also be able to just use this as is.

FredericBlum commented 3 years ago

Thanks, that worked like a charm (in the heggartyandean repo)!

For some context: this Chibchan data is kind of a playground right now for me to try some things and explore topics in Historical Linguistics as well as presenting some stuff at StuTS, but the real workload is put into the Quechua project. A lot of stuff is really useful for both projects, however, such as the orthography profiles. If everything works out as planned, we can send you the Quechua data soon for uploading to EDICTOR.

lexibank / constenlachibchan

Multiple orthography profiles #4