lexibank / kochtukanoan

raw csv data from koch-grunberg's desano-yupua-yahuna-koretu word list
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

affricates are not rendered well in teh orthoprofiles #13

Closed LinguList closed 2 years ago

LinguList commented 2 years ago

If you inspect the edictor file, you find many t s not rendered together as ts. These need to be added to the profile. So I suggest you thoroughly inspect the edictor wordlist and refine the orthoprofile at the same time.

MottaAM commented 2 years ago

I missed some other symbols when I made the first conversion to IPA, so I added them to preprocess-sounds.tsv. I also added ts to account for the affricates.

MottaAM commented 2 years ago

What I don't know that to do is with tx̯ as it looks like it should be either or . However, alone is ç, since Koch-Grünberg describes it as being the same sound as in 'ich'. Can the 'ch' in 'ich' be pronunced as [ʃ]? @LinguList There was an article I read that was published in same journal Koch-Grünberg published (Anthropos), it talks about the graphemes used around that time to transcribe languages. That article says is an anterior palatal fricative sound and k̯x is an anterior palatal affricate sound. In the chart it uses to show the graphemes, would be located on the same part as ç would be located in an IPA chart. What should I do? Do the Tukanoan languages he studied have a sound? @thiagochacon

here are some examples:

2075  tx̯óe̠i̯ --> t ç ɔ ɛ j

1739  uaxtx̯udi --> u a x t ç u d i
MottaAM commented 2 years ago

Koch-Grünberg uses both and y or and w, sometimes in the same word. However, because of the description he gave for that diacritic he is using, there should't be a difference, since both and y should represent the same sound.

In the case of and , I think we should maintain the vowels with that same diacritic in the conversion to IPA, since the IPA uses the same diacritic to mark diphthongs. What should I do about and ? @thiagochacon

Here are some exemples:

1662 ānoi̯ye̠ -- > aː n ɔ j j ɛ

2080 houo̯úo --> h ɔ w o u ɔ

1371 kāgakáue̯ --> k aː g a k a w e

180  dipau̯ía --> d i p a w i a
LinguList commented 2 years ago

Well, I would stay with ç, but the affricate must be written as , so the correct handling would be:

2075  tx̯óe̠i̯ --> tɕ ɔ ɛi̯
1739  uaxtx̯udi --> w a x tɕ u d i

Cases like e̯ and o̯ are fine, but I do not really trust these are different from i̯ and u̯, I rather think people exaggerate. Note that e̯ o̯ are judged to be vowels in our system, and that they are best kept with the vowels they attach to to form diphthongs.

Ideally then, for your examples, the orthoprofile modifies them as follows:

1662 ānoi̯ye̠ -- > aː n ɔi̯ j ɛ
2080 houo̯úo --> h ɔ w o u ɔ
1371 kāgakáue̯ --> k aː g a k a w e
180  dipau̯ía --> d i p a w i a

This is based on my experience with alignments. The rule is: three vowels, which we refuse to handle as one sound in CLTS often consist of two syllables. Something like oi̯a is then o j a. On the other hand, if the non-syllabic vowels are part of a diphthong, they are typically and off-glide or an on-glide. Here, I suggest to mark on-glides in their own slot, separating them, like i̯a -> j a, while keeping off-glides as diphgthong ai̯ -> ai̯ or ai.

LinguList commented 2 years ago

This all needs to be addressed in the profile. We can, however, pull out the cases where it happens.

thiagochacon commented 2 years ago

What I don't know that to do is with tx̯ as it looks like it should be either or . However, alone is ç, since Koch-Grünberg describes it as being the same sound as in 'ich'. Can the 'ch' in 'ich' be pronunced as [ʃ]? @LinguList There was an article I read that was published in same journal Koch-Grünberg published (Anthropos), it talks about the graphemes used around that time to transcribe languages. That article says is an anterior palatal fricative sound and k̯x is an anterior palatal affricate sound. In the chart it uses to show the graphemes, would be located on the same part as ç would be located in an IPA chart. What should I do? Do the Tukanoan languages he studied have a sound? @thiagochacon

here are some examples:

2075  tx̯óe̠i̯ --> t ç ɔ ɛ j

1739  uaxtx̯udi --> u a x t ç u d i

Yes, transcribing Tukanoan languages in general we could use [tʃ] or [c]. Only one author uses [c], while others use [tʃ]. It seems that the distinction between [tʃ], [c] and [tɕ] is not clear in Tukanoan and I think we should normalize it in a broad transcription phone [tʃ]

thiagochacon commented 2 years ago

Well, I would stay with ç, but the affricate must be written as , so the correct handling would be:

2075  tx̯óe̠i̯ --> tɕ ɔ ɛi̯
1739  uaxtx̯udi --> w a x tɕ u d i

Cases like e̯ and o̯ are fine, but I do not really trust these are different from i̯ and u̯, I rather think people exaggerate. Note that e̯ o̯ are judged to be vowels in our system, and that they are best kept with the vowels they attach to to form diphthongs.

Ideally then, for your examples, the orthoprofile modifies them as follows:

1662 ānoi̯ye̠ -- > aː n ɔi̯ j ɛ
2080 houo̯úo --> h ɔ w o u ɔ
1371 kāgakáue̯ --> k aː g a k a w e
180  dipau̯ía --> d i p a w i a

This is based on my experience with alignments. The rule is: three vowels, which we refuse to handle as one sound in CLTS often consist of two syllables. Something like oi̯a is then o j a. On the other hand, if the non-syllabic vowels are part of a diphthong, they are typically and off-glide or an on-glide. Here, I suggest to mark on-glides in their own slot, separating them, like i̯a -> j a, while keeping off-glides as diphgthong ai̯ -> ai̯ or ai.

I agree with Mattis to convert i̯ and u̯ to [j] and [w] when there are three vowel sequences. Otherwise, we can keep them as full vowels [i] and [o]. The idea is that Tukanoan languages allow glides as onsets of syllables, but they do not have proper dipththongs but vowel clusters instead (sequences of two tautosyllabic [almost full] vowels)

LinguList commented 2 years ago

Vowel clusters can also be handled -- this would for now be experimental, but I'd like to go for it -- by adding a dot between two vowels that cluster (i.e., build an "evolving unit"). So you don't need to resolve the question on diphthong or two vowels, but would explicitly emphasize that these two vowels evolve TOGETHER. This would result in writing something like "a.i" instead of "ai". And "a.a" for things we'd otherwise mark as long vowels.

LinguList commented 2 years ago

Lexibank will throw an error here for these instances, but we can handle it by writing them in an extra column for now. If you guys work out that kind of representation, I'd gladly implement the hack to the Python code that we need so it is formally accepted.

thiagochacon commented 2 years ago

so basically we have to: 1) convert all VVV sequences to VGV (V vowel G glide) 2) recode every V1V2 (where V1 and V2 are different vowel graphemes) to V1.V2

Is that it?

LinguList commented 2 years ago

Yes, and this is done in the orthography profile. You can even do it semi-automatically, using this tool here: https://digling.org/calc/profiler/

@MottaAM, in this tool, you can generate orthography profile lines. The rule generator for orthography profiles allows you to make rules out of charcter lists.

So you type in:

[1 a e i o u][2 m n] > [1 a e i o u][2 ◌̃ ◌̃]

and you get:

am  ã
an  ã
em  ẽ
en  ẽ
im  ĩ
in  ĩ
om  õ
on  õ
um  ũ
un  ũ

In the same way, you can use all vowel symbols you identified there and make these "all possible combination" rules.

LinguList commented 2 years ago

these rules can then be added to the orthoprofile, and you can run the code and check results, by then converting the file to edictor-tsv, and loading it into edictor for checking.

MottaAM commented 2 years ago

Lexibank will throw an error here for these instances, but we can handle it by writing them in an extra column for now. If you guys work out that kind of representation, I'd gladly implement the hack to the Python code that we need so it is formally accepted.

@LinguList I'll make the conversion to the vowel clusters then. Where do I add the extra column? In the orthography.tsv?

MottaAM commented 2 years ago

Yes, and this is done in the orthography profile. You can even do it semi-automatically, using this tool here: https://digling.org/calc/profiler/

@MottaAM, in this tool, you can generate orthography profile lines. The rule generator for orthography profiles allows you to make rules out of charcter lists.

So you type in:

[1 a e i o u][2 m n] > [1 a e i o u][2 ◌̃ ◌̃]

and you get:

am    ã
an    ã
em    ẽ
en    ẽ
im    ĩ
in    ĩ
om    õ
on    õ
um    ũ
un    ũ

In the same way, you can use all vowel symbols you identified there and make these "all possible combination" rules.

Thank you! I'll start working on it right away

MottaAM commented 2 years ago

I'm not sure how do I group the vowels exactly. Just to confirm if I got it right:

if VVV > VGV and V1V2 > V1.V2, then 1662 ānoi̯ye̠ --> a.a n ɔ.i j ɛ 2080 houo̯úo --> h ɔ.u ɔ.u ɔ 1371 kāgakáue̯ --> k a.a g a k a w e 180 dipau̯ía --> d i p a w i a

I'm just not sure what do I do in a case like 2080 houo̯úo since there is a sequence of five vowels

thiagochacon commented 2 years ago

Let's do this way 2080 houo̯úo --> h ɔ.w ɔ.w ɔ

LinguList commented 2 years ago

For these cases which are clear dipthongs, I suggest not to group vowels with glides. I would rather interpret it as a semi-vowel, and use, for example, instead of j. So I'd write ai̯, or ai.

But essentially, as our grouping gives the linguists all freedom they want, you can of course just group o.w. As long as it is consistent, it is just fine.

MottaAM commented 2 years ago

I've been testing different profiles and I've made one that the only error it gives me is related to the vowel clusters (as expected). I ran the following commands: cldfbench lexibank.check_profile lexibank_kochtukanoan to check for errors, cldfbench lexibank.makecldf lexibank_kochtukanoan and edictor wordlist --dataset=cldf/cldf-metadata.json --addon=cogid_cognateset_id:cogid -n wordlist to check if the conversion was working. There are a few words that are not good, but the profile worked for the majority of them. Can I fix them manually?

LinguList commented 2 years ago

You could. But we should rather document this also in our workflow. Can you show some examples, let us say, 5 to 10 of these manual fixes here, with the original form? I'd then recommend how to do the manual fix in the code already.

MottaAM commented 2 years ago

I tried some other profiles, but I'm still having the same problem. There are long sequences of vowels that I would have to work case by case to solve properly. Sometimes, when I make a change to fix a problem, some other problem apears in another word. I have to keep the VVV > VGV and V1V2 > V1.V2 rule consistent and keep in mind that the syllable structure of those languages is CVV. Here are the examples:

original    generated by the profile    manual fix
díai̯yi d i.a j j i d i.a i j i
yeé̥    j j ɨ   j e.ɨ
oóe̠ka  ɔ.ɔ ɛ k a   ɔ w ɛ ka
nóḳoa   n ɔ k w a   n ɔ k ɔ.a
siuī́re̥    s i.u i.i ɾ ɨ   s i w i.i ɾ ɨ   
ihíui̯tsia  i h i.u j ts j a    i h i w i ts i.a    
nuxhoá  n u h w a   n u h ɔ.a
LinguList commented 2 years ago

Well, the problem is that the method is greedy. It starts with the largest solution. So what I do in these situations is to provide longer rules for cases like you show there. In the worst case, you use ^díai̯yi -> d i ai j i, so you write the full word as a rule to the orthoprofile.

The case of nóḳoa is an example for another rule, which is wrong, which should be modified, namely oa -> w a.

MottaAM commented 2 years ago

@LinguList I finished making the orthography profile and put a pull request. We need the hack to be able to properly use the notation with the dot in between the vowels (a.a). Besides that, I think we are done here with this repository

MottaAM commented 2 years ago

My mistake, I need to update the concepts file to mach the one we corrected in the concepticon repo

Now we are done