lexibank / abvdoceanic

Creative Commons Attribution 4.0 International
5 stars 2 forks source link

Should long vowels be counted for Xaracuu / xara1243 (and other New Caledonian languages) #43

Closed antipodite closed 2 years ago

antipodite commented 2 years ago

So Angela asked in the CoOL channel if the vowel count of 30 for Xaracuu / xara1243 were correct given that its sister languages in the dataset have much lower numbers of vowels reported.

This seems to be an issue of differing phonemic analyses. I looked at some of the references for these languages, and it is claimed by several authors such as Moyse-Faurie 1989, Osumi 1995 and Zetterberg 2021 that these languages have a phonemic long vowel counterpart to each short oral and nasal vowel which is listed for them.

I looked at McCracken 2019's discussion of this issue in her grammar of Belep (a northern language). She notes that the traditional approach to NC languages treats long vowels as separate phonemes instead of sequences of short vowels as has been done for Xaracuu. However in Belep she says that although there are minimal pairs that seem to bear this out, there is always a syllable boundary between the two "halves" of a long vowel and that speakers have no intuition regarding vowel length, but they reliably segment words such as /aawa/ into a.a.wa for example. Like in Xaracuu, Aijie etc Belep displays hiatus in sequences of short vowels, i.e. they are realised as heterosyllabic sequences of vowels.

Russell also pointed out in the chat that when there is reduplication of a syllable containing a long vowel in Xaracuu, only "half" of the long vowel is reduplicated, indicating that "long" vowels are not necessarily perceived as a single unit. This also appears to be the case in the closely related language Tiri / Tinrin / tiri1258 (Osumi 1995).

Zetterberg says of Aijie, following de la Fontinelle, that "Vowel length is phonemically distinct in Aijie" and that each oral and nasal vowel can appear as short or long, but then goes on to say that "this is analysed as an aspect of syllable structure", which seems to imply something similar to what is reported for Belep above.

Looking at Angela's comment then it seems like the issue is different authors have chosen different analyses, long vowels or vowel sequences (and that there is seemingly more justification to choose the latter), and different sources for different languages are showing up as a discrepancy in the number of vowels in the data. It seems like the best thing to do is not to count long vowels separately for the languages at issue, but it would possibly also be worth going through the other NC languages in the dataset to make sure the same issue isn't hiding in some of the languages with lower vowel counts. I get the impression that this issue is relevant for NC as a whole from McCracken's comments

angela-mc commented 2 years ago

I will add couple of figures to this issue. I went through the Oceanic tree pruned for the languages for which we have data on phoneme inventories and extracted pairs of sister languages (sharing the same parent node). Then I looked at the difference in #vowels between them, and I noticed that, Canala (Canala(glottocode xara1244)) & Aragur (abvdoceanic-Aragur, glottocode xara1243) had a huge difference in #vowels compared to all other sister-languages (Canala is listed with 30 vowels and Aragur with 12), despite the branch length connecting them not being particularly long. Of course this can happen, but I thought I would flag the example in case it is a scenario of discrepancies between how #vowels are counted. VowelDifferences.pdf

SimonGreenhill commented 2 years ago

ok, so Angela's analysis here makes it seem like (Canala, Aragur) are a pathological pair. Can we get a histogram on the overall number of vowels and number of consonants and see if we can spot other outliers?

angela-mc commented 2 years ago

VowelDiff.pdf VowelNumber.pdf

These are histograms for the #vowels: total number & differences between sister-languages. I've spelled out the right tails on each plot (I've been very generous, I would say for differences in #vowels, only 18 & 10 are outliers, but I thought might as well flag generously for illustration/exploration); for differences I've written the pair and then in brackets the number of vowels for each language.

angela-mc commented 2 years ago

ConsonantDifferences.pdf ConsonantNumber.pdf (same as above, but with number of consonants)

angela-mc commented 2 years ago

(I will attach the .csv file with the numbers too, the columns are: Language1 and Language2 (sister-languages), BranchLength (length of branches connecting the parent to each of the languages; the tree I am using is ultrametric, so these branches have the same length), vowel1 and vowel2 (number of vowels for each of the languages), DiffVowels (difference in number of vowels), consonants1 and consonants2 (number of consonants for each of the languages), DiffConsonants (difference in number of consonants), DiffVowelsTime (difference in number of vowels between the two sister languages divided by branch length), DiffConsonantsTime (difference in number of consonants between the two sister languages divided by branch length). surori_phoneme_inventories.csv

Looking at difference in number of consonants or vowels divided by branch length is also indicative I think, in the sense that we could expect sister languages who had more time to diverge to be more different in their number of vowels/consonants. I am attaching two histograms corresponding to DiffVowelsTime and DiffConsonantsTime from the dataframe. The clear outlier for DiffVowelsTime is the pair: Mortlockese_Lukunosh_923 and Satawalese_343 (outlier for the difference in vowels between the two languages as well with 10 vowels difference). DiffPerTime.pdf

SimonGreenhill commented 2 years ago

Hmm, it looks like the outlier Ulithian should have 19 consonants not 45.

It also worries me that Nimoa and Sudest have a difference of 15 despite being sisters.

I think we need to have some formal evaluation of how close we are to something like Phoible. Angela -- can you do a comparison between N Consonants & Vowels in our data and that listed in Phoible? If you get stuck I can do it but have a full month coming up.

angela-mc commented 2 years ago

Sure! Do we have some expectations about the differences in #consonants or #vowels between sister-species?

I am asking because this is helpful for setting the priors for the Fabric Model, there is a prior for the differences. It is by default set to Weibull distribution, with a shape and scale that translate to a left-tailed distribution (scale = 1.1, shape = 1.5). I think Weitbull left-tailed makes sense (ie: we expect small changes in general from ancestor to descendant, but we allow for some bigger changes to happen as well). I am trial-ing priors that simply mirror the distribution of differences in the contemporary languages. but I was wondering if there are any intuitions we might have based on linguistic knowledge?

HedvigS commented 2 years ago

I found 17 languages that are both in abvdoceanic and phoible. Here are scatterplots comparing their number of consonants and vowels. color reflects absolute diff

phoible_vs_oabvdoceanic_cons phoible_vs_oabvdoceanic_vowels

I matched over language level ID. Languages in abvdoceanic or PHOIBLE that didn't have glottocodes at all were ignored.

code for making the comparison is here. https://github.com/angela-mc/LinguisticDisparity/blob/hedvig_help_phoible/Rscripts/compare_abvdoceanic_phoible.R

If you know a way of better matching PHOIBLE entries to glottocodes or any other way of increasing the overlap, I'm all ears.

HedvigS commented 2 years ago

The highest diff is motl1237, where PHOIBLE says 15 consonants and abvdoceanic says 25.

HedvigS commented 2 years ago

Table for your browsing.

Language_level_ID Consonants_PHOIBLE Vowels_PHOIBLE Consonants_abvdoceanic Vowels_abvdoceanic vowels_diff consonant_diff
maor1246 10 10 11 10 0 1
hawa1245 9 10 9 6 4 0
koko1269 22 5 18 6 1 4
mana1295 13 5 22 6 1 9
motl1237 15 7 25 8 1 10
dobu1241 19 5 20 8 3 1
matb1237 15 5 22 7 2 7
seim1238 11 17 12 10 7 1
siar1238 15 7 19 7 0 4
sout2856 15 5 15 5 0 0
sata1237 29 13 29 12 1 0
yape1248 28 16 26 14 2 2
woge1237 15 5 22 5 0 7
loni1238 16 7 20 6 1 4
bari1286 12 5 14 6 1 2
biak1248 14 10 17 7 3 3
hoav1238 16 5 17 7 2 1
HedvigS commented 2 years ago

For what it's worth, I think @angela-mc 's approach of comparing sisters is really smart, especially since there are so few matches between PHOIBLE and abvdoceanic. I don't know if what should be done is re-evaluate the underlying data or the way that the phoneme counts are extracted or the orthography profiles, or all three. I'll leave that up to you all.

I alos thought we already talked about this very same issue several months ago? Haha or am I dreaming again? What was the solution proposed then?

maryewal commented 2 years ago

RE: Xaracuu, according to Clarie Moyse Faurie, length is definitely phonemic without question (this was also the analysis of Jean Claude Rivierre when he went to check on her work apparently). She has lots of minimal pairs to support this. Essentially, there are 11 plain short vowels + 6 nasal vowels, all with length distinction (as we know is reported in the literature). The choice to write long vowels as 2 short vowels was due to the overwhelming diacritic marking already on vowels in the language - she says orthography in this case should not be considered as indicative of long vowels actually being underlyingly short. There are a few rare cases when a seemingly long vowel is actually 2 short vowels, but these instances have clear historical trajectories (consonant loss) and are separate from the issue of phonemic length.

maryewal commented 2 years ago

RE: "It seems like the best thing to do is not to count long vowels separately for the languages at issue, but it would possibly also be worth going through the other NC languages in the dataset to make sure the same issue isn't hiding in some of the languages with lower vowel counts. I get the impression that this issue is relevant for NC as a whole from McCracken's comments" I disagree with the first part of this. I think we should count long vowels as phonemic when they are reported as such. New Caledonian scholars agree that these languages are unusual in their unusually large vowel inventories, and we don't have enough to counter their wealth of expertise and peer-reviewed publication on the matter. The issue of phonemic length here may indeed be worth questioning further and would be an excellent study that I am certainly interested in pursuing together, but until we do that, we have to rely on what is published/reported by experts. I agree with the second part, though, that we should make sure all the NC languages are accounted for as consensus in the available literature suggests. My conversations with Claire today make me think that we should actually be seeing higher counts in other nearby NC languages as well. The Northern languages may indeed be different.

maryewal commented 2 years ago

Regarding the stark difference between Aragur et Canala: The issue may be due to source. Leenhardt is our source for Aragur, which has a red flag on it concerning reliability. Canala is coming from George Grace's work so is going to be more complete. But another question, we also have sets from CMF for both of these languages, both with decent coverage. 1576 | Xârâcùù | xara1244 | Claire MOYSE-FAURIE & Marie-Adèle NECHERÖ-JOREDIE | 250 1609 | Xârâgurè | xara1243 | Claire Moyse-Faurie | 198 Are these showing the same discrepancy? Or have we chosen the Leenhardt/Grace sets over CMF's for the analysis? @angela-mc @SimonGreenhill

maryewal commented 2 years ago

Hmm, it looks like the outlier Ulithian should have 19 consonants not 45.

@SimonGreenhill We dealt quite a bit with Ulithian already (https://github.com/lexibank/abvdoceanic/issues/16), but maybe we need to check it through again to be sure the changes we made stuck? @antipodite should we take another look?

maryewal commented 2 years ago

Sure! Do we have some expectations about the differences in #consonants or #vowels between sister-species?

@angela-mc This is an excellent question. My thinking is that we would expect very closely related languages to resemble each other in numbers of both consonants and vowels (among Oceanic languages) @SimonGreenhill, thoughts?

SimonGreenhill commented 2 years ago

Great, can we get @antipodite to look at the mismatches in the languages that overlap between phoible and abvd -- what are we getting wrong and why/

Angela: don't think there are any intuitions, but could we use phoible to construct an empirical weibull prior e.g. get a distribution of differences between sisters in phoible and construct a weibull to mimic that?

antipodite commented 2 years ago

Hmm, it looks like the outlier Ulithian should have 19 consonants not 45.

@SimonGreenhill We dealt quite a bit with Ulithian already (#16), but maybe we need to check it through again to be sure the changes we made stuck? @antipodite should we take another look?

IIRC the issue with Ulithian was that the data contained words from two sources with different orthographies. I thought we had changed the orthography profile already to correct this but I'll have another look.

I'll investigate the mismatches between phoible and ABVD, I also have my own data on inventories for New Caledonia so can compare against this also

antipodite commented 2 years ago

Alright, I ran Simon's phoible to abvdoceanic comparison script again, comparing phoible inventories with inventories for the latest version of this repository.

(sorry, accidentally closed the issue) So we have some languages that have more phonemes in Phoible than ABVD and some that have more. These languages have more phonemes in ABVD than in Phoible, ordered by difference:

Of these:

Here is the data from the Phoible/ABVDOceanic comparison script: name glottocode total_abvd total_phoible delta tokens_abvd tokens_phoible
Malo (North) malo1243 35 21 -14 a b bʷ c d e i j k l m n o p pʰ pʷ r s sʷ t ts tʃ tʰ u w x ŋ ɔ ɛ ɣ β βʷ ⁿb ⁿbʷ ⁿd a e i k l m mb mbʷ mʷ n nd nɟ o r s t u x ŋ β βʷ
Anejom (Aneityum) anei1239 37 26 -11 a aː b dʒ dʒʱ e f h hʷ i iː j k kʷ l lʷ m n nʰ nʷ o oː p pː r s t ts tʰ u v w ŋ ɣ ɣʰ ɲ θ a cç f h j k l m mʷ n p pʷ s t v w ŋ ɔ ɛ ɣ ɪ ɲ ɾ ʊ ʔ θ
Mwotlap motl1237 32 22 -10 a b d h i j k kʰ l lʷ m mʷ n o p pʰ pʷ s t tʰ tʷ u w ŋ ɔ ɛ ɣ ɣʰ ɪ ɰ ʊ β a e h i j kpʷ l m mb n nd o s t u v w ŋ ŋmʷ ɣ ɪ ʊ
Manam mana1295 27 18 -9 a b bʷ d dʒ e g gʷ i j k l m mʷ n o p pʷ r s t u w z ŋ ɪ ʔ a b d e i k l m n o p s t u z ŋ ɡ ɾ
Marquesan nort2845 30 22 -8 a aː d e eː f g h hː i iː ì j k kː m n nː o oː p pʰ r t tː u uː v w ʔ a aː e eː f h i iː k m n o oː p r s t u uː v ç ʔ
Lenakel lena1238 29 21 -8 a aː e eː h i iː k l lʰ m mʰ mʷ n nʰ o p pʷ r s t u uː v vʰ w ŋ ŋʰ ə a e̞ f h i k l m mʷ n o̞ p pʷ s t u w ŋ ə ɰ ɾ
Roro waim1251 22 14 -8 a ã b e h i iː ĩ j k l m n o p r s t u v w ʔ a b e̞ h i k m n̪ o̞ p t̪ u ɾ̪ ʔ
Wogeo woge1237 27 20 -7 a b bʷ d dʒ e f g i j k kʷ l m mʷ n o p r s t u v w x ŋ ɲ a b d̪ e f i j k l̪ m n̪ o r s t̪ u v ŋ ɡ ɲ
Kwaio kwai1243 26 21 -5 a aː b d e eː f g gʷ i iː k kʷ l m n o oː r s t u uː w ŋ ʔ a e̞ i l m mb n nd o̞ s t u w x xʷ ŋ ŋɡ ŋɡʷ ŋʷ ɸ ʔ
Ponapean pohn1238 24 20 -4 a d e eː g h i j k l m mʷ mː n o p pʷ r s t u w ŋ ɛ a j k l̪ m mʷ n̪ o p pʷ r s t̪ u w ŋ ɔ ɛ ɪ ʈʂ
Hoava hoav1238 24 21 -3 a b d dʒ e g h i k l m n o p r s t u v z ŋ ɔ ɛ ɣ a b d d̠ʒ h i k l m n p r s t u ŋ ɔ ɛ ɡ ɣ β
Siar siar1238 25 22 -3 a b d e f g h i j k l m n nː o p r rʷ s t u v w ŋ ə b d e e̝ i j k l m n o o̝ p r s t u w ŋ ɑ ɡ ɸ
Bariai bari1286 20 17 -3 a b d e g i k l m n o p r s t u uː v w ŋ b d e i k l m n o p r s t u ŋ ɑ ɡ
Sursurunga surs1246 24 21 -3 a b d e g h i j k l lʰ lʷ m n o p r s t u v w ŋ ə b d e h i j k l m n o p r s t u w ŋ ɐ ə ɡ
Dobu dobu1241 27 24 -3 a b bʷ d e g gʷ i j k kʷ l m mʷ n o oː p pʷ r s t u uː w ʔ ʔʷ a b bʷ d i j k kʷ l m mʷ n p pʷ s t u w ɔ ɛ ɡ ɡʷ ʔ ʔʷ
Poai fwai1237 38 35 -3 a aː c dʒ e ẽ f g i k l m n o oː p r sʰ ts t̪ u v w ø ŋ ɛ ɣʷ ɲ ʔ ʝ ʰk ʰm ʰn ʰp ʰv ʰʝ ⁿb ⁿd a aɨ ă e f h i ĭ j k kʰ l̪|l m n̪|n ŏ p pʰ t̠ʃ t̠ʃʰ t̪|t t̪ʰ|tʰ u ŭ v ŋ ɔ ɔ̆ ə ɛ ɛ̆ ɨ ɨ̆ ɬ̪|ɬ ʃ ʔ
Loniu loni1238 25 23 -2 a c d e h i j k l m mʷ n o p pʷ r s t ts u w ŋ ɪ ɲ ʔ a e h i j k l m mʷ n o p pʷ r s t t̠ʃ u w ŋ ɔ ɛ ɲ
Tongan tong1325 19 17 -2 a aː e eː f h i k l m n o p s t u v ŋ ʔ a e f h i k m n̪ o p s t̪ u v ŋ ɺ ʔ
Maori maor1246 21 20 -1 a aː e eː g h i iː k m n o oː p r t u uː w wʰ wʱ f h i iː k m n oː o̞ p t u uː w ŋ ɑ ɑː ɛ ɛː ɾ
Mafea mafe1237 25 24 -1 a e f h i k l m mʰ mʰː n nʰ o p r s t u v vʱ w ŋ ʈ ʔ β e i k l lː m mː m̼ n nː o p p̼ r rː s t u v v̼ ŋ ɑ ɖ ɰ

Just checking some other ones, more soon