Closed FredericBlum closed 3 years ago
- if a directory etc/orthography/ exists, all *.tsv files in it will be considered orthography profiles, and a dict mapping filename stem to tokenizer will be available. Tokenizer selection can be controlled in two ways:
- Passing a keyword profile=FILENAME_STEM in Dataset.tokenizer() calls.
- Provide orthography profiles for each language and let Dataset.tokenizer chose the tokenizer by item['Language_ID'].
@LinguList Is this still the preferred way of implementing multiple orthography profiles? Do I need to add any information to the lexibank-script or will the assignment proceed automatically once the directory and the respective profiles exist?
My preferred way is now:
etc/orthography/*.tsv
as separate files, follwoing the
language IDbut you can also start directly and use the two in combination, i.e.,
one etc/orthography.tsv
along with etc/orthography/Maya.tsv
.
Would you consider sharing that script, so that we could make for example an individual orthography file for each language involved in the conversion of this repository? This would make it a lot easier to combine the data with data from other varieties and to derive all of them in a single cldf-conversion instead of splitting everything across repositories. I guess I could just copy the file 15 times and rename, but if there is a script, why not use it. We are currently aiming to finish the collection of data until the end of march.
Another small question that came to my mind: In order to filter out any borrowings for any phylogenetic model, I guess the easiest way is to load the wordlist into LingPy and create a subset of entries which includes only non-borrowed words, correct? You used a similar way to filter for a specific subgroup in the LingPy tutorial, but I don't see why it wouldn't work for borrowings as well, or is there another recommended workflow?
Sure, that was of course implied ;) but I'd suggest to do so only AFTER you have a first draft profile for all languages. If this is the case, we can start immediately.
from csvw.dsv import UnicodeDictReader
from collections import defaultdict
from clldutils.text import strip_brackets
from unicodedata import normalize
with UnicodeDictReader('../cldf/forms.csv') as reader:
data = [row for row in reader]
with UnicodeDictReader('../etc/orthography.tsv', delimiter="\t") as reader:
profile = {}
for row in reader:
profile[normalize('NFC', row['Grapheme'])] = row['IPA']
languages = set([row['Language_ID'] for row in data])
profiles = {language: defaultdict(int) for language in languages}
errors = {}
lexemes = {}
for row in data:
for char in row['Graphemes'].split():
char = normalize('NFC', char)
profiles[row['Language_ID']][char, profile.get(char, '?'+char)] += 1
if [x for x in '˩˨˧˦˥' if x in row['Value']]:
tone = False
out = []
for char in strip_brackets(row['Form']).split(' '):
char = normalize('NFC', char)
if [x for x in '˩˨˧˦˥' if x in char]:
if tone:
out[-1] += char
else:
out += [char]
tone = True
else:
tone = False
try:
out[-1] = char+out[-1]
except IndexError:
out += [char]
lexemes[row['Value']] = ''.join(out)
for language in languages:
with open('../etc/orthography/'+language+'.tsv', 'w') as f:
f.write('Grapheme\tIPA\tFrequency\n')
for (char, ipa), freq in profiles[language].items():
f.write('{0}\t{1}\t{2}\n'.format(char, ipa, freq))
if ipa.startswith('?'):
errors[char] = ipa[1:]
with open('../etc/lexemes2.tsv', 'w') as f:
for a, b in lexemes.items():
f.write(a+'\t'+b+'\tswap tones\n')
with open('addons.tsv', 'w') as f:
for a, b in errors.items():
f.write(a+'\t'+b+'\n')
BTW: this is pasted from another script I used, so you may need to adjust variable names (e.g., "lexemes2.tsv"), but since your data doesn't have tones, and this is only to create the first profile, you should also be able to just use this as is.
Thanks, that worked like a charm (in the heggartyandean repo)!
For some context: this Chibchan data is kind of a playground right now for me to try some things and explore topics in Historical Linguistics as well as presenting some stuff at StuTS, but the real workload is put into the Quechua project. A lot of stuff is really useful for both projects, however, such as the orthography profiles. If everything works out as planned, we can send you the Quechua data soon for uploading to EDICTOR.
If I were to collect more data in order to expand the wordlist, It would be great to find a way for a single cldf-conversion which includes different orthography profiles or, in case thats possible, an orthogrpahy profile with language-specific segments.