check problematic lexemes and orthography

LinguList commented 4 years ago

Okay, the final step, to run you through our workflow, @Maunus, is to convert your specific orthographies to IPA.

Step 1: as our format does not allow to have comments in the form itself, we have an automated way of identifying those. I have written them in the file etc/lexemes.tsv. You have the form, a description o fthe error, and in second position a question mark. There, the form without comment (without modifying it further) needs to be provided. Basically, this is just copy-pasting and deleting the PUA, SUA (similar to what you did in the wordlist.tsv file, although this file is automatically generated and will be overwritten, so it should not be edited now).

Step 2: in the file etc/orthography.tsv, I have automatically assembled a first conversion of your orthography to plain IPA (which will also segment the sounds automatically). Easiest cases are all cases with y, where the IPA equivalent is j. But open questions remain: geminates should be written with a length marker, not with two consonants, same for long vowels, etc. The file has been automatically generated and contains already a lot of suggestions.

Let me know, once you have corrected the cases, then I'll run the code again, and you'll see that we may even align the data.

Maunus commented 4 years ago

I will take a look. I already corrected some segmentation issues in the main file where kw was segmented as k w and tl as t l for example - and where vowel and vowel/seminvowl sequences were all interpreted as diphthongs. There is also a difference between actual geminates and double consonants.

Maunus commented 4 years ago

How do I correct it? For example hw and lw are consonant sequences not phonemes (same with several others).

Maunus commented 4 years ago

Another set of difficult ones are the vowels with tone marks and the vowels with glottalization in Cora (o'o, u'u etc.) I think they should ideally be considered a single long vowel with glottalization.

LinguList commented 4 years ago

Yes, I saw this. The new orthography is better for this purpose, as it is independent of the actual data, and you only correct each segment once. If segment combinations are not found there, you should add them in an extra row (tl, for example), and provide an IPA equivalent (kw is probably k+superscript-w, right?)

LinguList commented 4 years ago

if you see "lw" proposed as one sound, but you want to have it as two, you can either delete the line, or you can write "l w" (with space) in the IPA column.

LinguList commented 4 years ago

a glottalized vowel in IPA should be written with a glottalization mark (see here, our reference list of permissible IPA sequences).

LinguList commented 4 years ago

So oˀ for example, with length marker if it is long.

Maunus commented 4 years ago

Ok, how about tone? Also is the verbal description strictly necessary? With the sequences it becomes complicated.

Maunus commented 4 years ago

Can I basically delete all the lines that are sequences?

Maunus commented 4 years ago

Hmm, some of them are ambiquous, so kw can be either /kw/ or /kʷ/ and /ty/ can be either /tʲ/ or /tj/. And acute accent can be tone or accent depending on the language.

Maunus commented 4 years ago

Also sometimes ai is a diphthong and sometimes a sequence depending on language (and sometimes depending on the word).

LinguList commented 4 years ago

We need the first two columns, nothing more. The other columns are a reference, so you unerstand what the computer makes of the sounds it is given. So you can see that "y" is interpreted as a vowel, although it is a consonant. This helps you to correct it. You can also see that we autocorrect errors resulting from wrong IPA symbols. You write a:, but you use the colon ":", although it is anohter Unicode: "aː". This all is for consistency and we'll double check. By now, however, just the first two columns need to be provided.

LinguList commented 4 years ago

we recommend to annotate tone properly, not on the vowel, by using chao numbers, but you can leave the tone on the sounds as well, it will then be rendered as "XXX vowel with high tone". You an also write í/i, so you say: í is the i with tone in my data, but the computer should only read the "i", as this is the thing without tone that the computer will accept.

LinguList commented 4 years ago

I prefer you tell explicitly that they are two sounds, by inserting a " " (space).

LinguList commented 4 years ago

Well if things are ambiguous, it already shows why this is in fact important to do, and why it is useful to use IPA as a reference and not use original orthographies. For now, you could provide two lines and mark the language in the column for "languages". So we can see if we can just disambiguate them,

LinguList commented 4 years ago

For dipthongs, etc., I recommend: if you have ambigious words, count them in your data. But you place If there are many, place them in "lexemes.tsv", where you can convert them. So you take the value in yoru original cell, and provide the form that we should use (unsegmented).

LinguList commented 4 years ago

Also sometimes ai is a diphthong and sometimes a sequence depending on language (and sometimes depending on the word).

Take the word that is ambigious and uses ai not as a dipthong from your original spreadsheet (wod document). Copy-paste it into "lexemes.csv" (in etc/). Then provide a replacement, where you add a dot to to the dipthong, like a.i. This will do the trick.

LinguList commented 4 years ago

Same can be done for k.w.

Maunus commented 4 years ago

Unfortunately for a few of the languages the sources are old enough that they don't give a good phonological analysis - for example we don't really know if Huichol and Cora has tone, and how it works if they do. And it is not consistently noted. Same for questions of vowel sequences and diphthongs etc. That is why I have used the original ortography in the data (also because I wasn't planning to do any phonological analysis).

Maunus commented 4 years ago

Ok.

Maunus commented 4 years ago

Oh, should I always add a period instead of a space when I reinterpret sequences? Does it screw up the alignment if I add spaces between phonemes in the second row?

LinguList commented 4 years ago

Second column (cell), right? Alignmetns depend on the segmentation. Your decisions have a direct impact on this. E.g, if you say "pf" in German are two sounds, all our alignments with Germanic are screwed up. But you best learn that when you see the alignments. So for now, I say it does not really make a big difference, as long as it is consitent. And uncertainty doesn't bother us: distinctivity counts in structural lingusitics, right? So one should not lose distinctivity but if something is mis-interpreted, we can live with that: original data are there, an interpretation can be revised, etc.

So I'd say: for a first test here, we need a pragmatic first version, so you see the principle of the orthography profiles. You can even test it yourself. Just go to digling.org/calc/profile/, paste the current profile (file orthography.tsv) in the text field, press okay, and you can type in text from your data and see how it converts.

Maunus commented 4 years ago

Is there a way I can test it on the entire document (i.e. producing a new version of wordslist.tsv) ? I think I have got it now and want to see what it produces.

LinguList commented 4 years ago

Do you want to run the full workflow? That would require:

having python3 (version 3.5 and higher) installed
having access to a terminal application (on Windows, you can run Python now also from the terminal, but you will need fiddle a bit around with it, on Mac it is easier, but they have often broken Python implementations)
being prepared to learn how to make a virtual environment, to make sure that you only use the most appropriate versions for the code
installing git (as a major requirement for getting a specific version of a specific piece of code, again, easier on Mac)

If it's for a next checking, I can also run the code again and point you to some things I observe, etc.

Maunus commented 4 years ago

Ah, ok, I didn't realize all that would be necessary, that is perhaps a bit more than I could realistically do myself.

lexibank / pharaocoracholaztecan

check problematic lexemes and orthography #9