lexibank / bowernpny

CLDF dataset derived from Bowern and Atkinson's "Internal Structure of Pama-Nyungan" from 2012
Creative Commons Attribution 4.0 International
0 stars 1 forks source link

Orthography profile: non meaningful contrasts ? #12

Open XachaB opened 3 years ago

XachaB commented 3 years ago

Hi,

I have the impression that there might be spurious contrasts in the orthography profile, in particular voicing contrasts in occlusives (p/b, g/k, t/d). I also suspect that the ɹ/r contrast is maybe not meaningful.

I suggest that we figure out precisely which contrasts are due to variation in descriptive practice, and which are truly contrasts imputable to sound change etc, and neutralize meaningless contrasts.

As to how to normalize, we have three other datasets with languages from these families, and should make sure we are using the same notations:

Lexibank dataset Sounds found
bowernpny + _ a aː ã b bː c cʷ cː d dʒ dʱ dː d̪ e eː f g gʷ gː h i iː j k kʷ kː l lʷ lː l̪ m mː n nʲ nː n̪ n̪ː o oː p pː q qː r rː s t tʃ tʲ tː t̪ t̪ʷ t̪ː u uː ũ v w x yː z æ ð ø ŋ œ ɐ ɑː ɔ ɖ ə ɛ ɛː ɜ ɣ ɤ ɨ ɪ ɭ ɲ ɳ ɹ ɽ ɾ ʀ ʈ ʊ ʒ ʔ ʔʲ ˀb ˀd ˀdʒ ˀk ˀm ˀn ˀr ˀt ˀt̪ ˀw ˀɭ β θ
johanssonsoundsymbolic + a aː c i j k l l̪ m n n̪ p r rː t t̪ u w ŋ ɭ ɲ ɳ ɽ ʎ
joophonosemantic a aː i j k l l̻ m n n̻ p r t t̻ u uː w ŋ ɭ ɳ ɽ ʈ
wold + _ a g i iː j k l m n p r t u w y ŋ ɲ

As you can see, the other ones use p-k-t, not b-g-d, and have a single /r/ sound. https://github.com/lexibank/wold does have a k/g contrast (if it is also meaningless, we should change it there).

@erichround, @chirila, could you chime in on whether these contrasts should be kept here ? Are there other contrasts that should be neutralized in the list above ? @tresoldi, it looks like the orthography profile was from you, do you remember if there was specific motivations for these contrasts ?

For a closer look, the list of sounds with counts can be found in the TRANSCRIPTION file: https://github.com/lexibank/bowernpny/blob/master/TRANSCRIPTION.md

Having non meaningful contrasts causes issues with downstream analyses of the data, especially in the sound correspondence study.

XachaB commented 3 years ago

In case it helps assess the (potential) problem, here is a (long) detailed list of which sets of sounds among p/k/t/b/d/g/ɹ/r etc are present for each languages:

dataset Glottocode Language_ID sounds_contrasts
bowernpny pall1243 Pallanganmiddang b d g k p r t
bowernpny ande1247 kalk1246 maln1239 mart1256 pitj1243 walm1241 wanm1242 warl1255 west2441 Antekerrepenhe Kalkatungu Malngin MartuWangka Pitjantjatjara SouthernWalmajarri Warlmanpa Warnman WesternArrarnta k p r t ɹ
bowernpny kanj1260 umpi1239 Kaanju Umpila k p r t
bowernpny alya1239 djuw1238 guga1239 kurr1243 mang1383 ngaa1240 nort2753 nort2754 nyan1301 pint1250 pint1251 warl1254 Alyawarr Jiwarliny Kukatj Kurrama Ngaanyatjarra Ngardily NorthernMangarla NorthernNyangumarta Nyangumarta PintupiLuritja Wangkatja Warlpiri g k p r t ɹ
bowernpny wang1289 Wangkayutyuru d k p r t ɹ
bowernpny wikm1247 WikMungkan d g k p r t ɹ
bowernpny yulp1238 Yulparija b k p r t ɹ
bowernpny kart1247 waru1265 Kartujarra Warumungu b g k p r t ɹ
bowernpny guri1247 paka1251 yany1243 yarl1238 Gurindji Pakanh Yanyuwa Yarluyandi b d k p r t ɹ
bowernpny kara1476 KarajarriNW b d g ɹ
bowernpny kung1258 Gunggari b d g r t ɹ
bowernpny duun1241 gudj1237 wulg1239 Duungidjawu Gudjal Wulguru b d g k r ɹ
bowernpny dhur1239 yaga1256 Dhurga Durubul b d g k r t ɹ
bowernpny kuku1280 ngad1257 KuguNganhcara Ngadjuri b d g k p t ɹ
bowernpny warr1255 Wargamay b d g k p r ɹ ɽ
bowernpny baty1234 bili1250 kumb1268 waru1264 Batyala Bilinarra Gumbaynggir Warungu b d g k p r ɹ
bowernpny gurd1238 Kurtjar b d g k p r t ɹ ʀ
bowernpny kala1380 muru1266 yind1248 Garlali Muruwari Yindjilandji b d g k p r t ɹ ɾ
bowernpny dyir1250 Dyirbal b d g k p r t ɹ ɽ
bowernpny adny1235 angu1242 arab1267 awab1243 badi1246 badj1244 band1358 bang1339 bayu1240 bidy1243 biri1256 birr1241 bula1255 bung1264 cola1237 darl1243 dayi1244 dhal1245 dhan1270 dhar1247 dhar1248 dhud1236 dier1241 djam1256 djap1238 djin1253 djiw1241 flin1247 gami1243 gang1268 gath1234 gugu1255 guma1253 gund1249 guny1241 gupa1247 gure1255 guwa1242 guya1249 hawk1239 jaru1254 kala1377 kala1379 kara1476 kari1304 karr1236 kaur1267 kera1256 kuka1246 kuku1273 kula1275 kuun1236 leni1238 lowe1402 madh1244 malg1242 maly1234 marg1253 maya1280 mayi1234 mayi1235 mayi1236 mbab1239 minj1242 mith1236 narr1259 naru1238 ngad1258 ngam1284 ngar1235 ngar1287 ngar1296 ngar1297 ngaw1240 ngun1277 nhan1238 nhir1234 nort2760 nyam1271 nyun1247 pany1241 pirr1240 pitt1247 rirr1238 rita1239 sydn1236 thur1254 wadi1249 wadi1260 waga1260 waja1257 waju1234 wang1290 wang1291 ward1248 wari1262 warl1256 wath1238 wira1262 wira1265 woiw1238 wong1246 yabu1234 yaga1262 yala1262 yand1253 yann1237 yawa1258 yidi1250 yind1247 yira1239 yort1237 yuga1244 yuwa1242 Adnyamathanha Arabana Awabakal Badimaya Badjiri Bandjalang Biri Birrpayi Bularnu Bunganditj Colac Darkinyung Dhangu Dharawal Dharuk Dharumbal Dhayyi Dhudhuroa Diyari Djambarrpuyngu Djapu Djinang FlindersIsland Gamilaraay Gangulu GoorengGooreng Gumatj Gunditjmara Gundungurra Gunya Gupapuyngu GuuguYimidhirr Guwa Guyani Iyora Jaru Jiwarli Kamilaroi Karajarri Kariyarra Karuwali Katthang Kaurna Keramin Kukatja KukuYalanji Kungkari Kurnu Linngithigh Mabuiag Malgana Malyangapa Margany MathiMathi MayiKulan MayiKutuna MayiThakurti MayiYapi Mbabaram Mbakwithi Minjungbal Mirniny Mithaka Narrungga Ngadjumaya Ngaiawang Ngamini Ngarigu Ngarinyman Ngarla Ngarluma Ngarrindjeri Ngawun Ngiyambaa Ngunawal Nhanta Nhirrpi Nyamal Nyungar Paakantyi Panyjima Parnkala Payungu Piangil Pirriya PittaPitta Rirratjingu Ritharrngu Thalanyji Tharrgari Wajarri Wakaya Wangkangurru Wangkumara Wardandi Warluwarra Warriyangga Wathawurrung Wathiwathi Watjuk Wiradjuri Wirangu Woiwurrung YabulaYabula Yagara Yalarnnga Yandruwandha Yannhangu Yawarrawarrka Yidiny Yindjibarndi Yiningay Yirandali YortaYorta Yugambeh Yuwaalaraay b d g k p r t ɹ
bowernpny dyaa1242 Djabugay b d g p r t ɹ
bowernpny kuuk1238 KuukuYau k p t ɹ
johanssonsoundsymbolic joophonosemantic ngar1287 pitj1243 Ngarluma Pitjantjatjara k p r t ɽ
wold guri1247 Gurindji g k p r t

Quite a few seem to have g/k, maybe a clue that it is sometimes contrastive ?

LinguList commented 3 years ago

I suggest to also run a direct comparison with this dataset against the inventories as they are provided in https://github.com/cldf-clts/clts/issues/76

As we integrate all data now, we could then directly compare basic similarities, etc.

tresoldi commented 3 years ago

Some differences might be explained by different concept sets, others by the sample: wold has only Gurindji, while bowernpny has almost 200 varieties. I remember that my main source, along with references listed in Glottolog which I could track on-line, were the inventories provided in Phoible (which, for Pama-Nyungan languages, are in almost all cases given by @erichround).

Now, there are probably errors or at least questionable transcriptions, undoubtedly -- I am no expert in Pama-Nyungan by any means, and this is an old profile (from when we had a single one per dataset). Nonetheless, I suppose it is in part also related to the solutions different authors employ for rendering in the transcription not only contrasts in stop series, but surface differences. First, it might not be necessarily what we'd "expect" in terms of modal voice vs. fully open glottis -- after all, it is true that in most languages of Australia (with the exception of some Northern ones) we don't expect a voicing distinction and, even more, the strong phonological similarities are one of the main reasons for determining it as a family. There are cases where the consonants are only semivoiced, but the IPA graphemes for voiced consonants are used (this is the case of Hercus in her grammar of Wirangu, here).

Second, and more important, they are not necessarily phonological in terms of a global contrastive correspondence. If you look at the original files in the history, I was also including (automatic) alignments, which I used for studying the dataset, you can see patterns like k-g-k everywhere. I remember there are even a handful of synonyms which were just expressing k/g or t/d as allophones. But take the example from Alpher (2004) cited by Miceli (2005) (here, and see the example of Wirangu discussed above):

Language Form Meaning
PPN *kampa- ‘cook in earth oven’
Uradhi aβa- ‘cover with sand’
Wik-Mungknh ka:mp- ‘cook in earth oven’
Djabugay gampa(:) ‘cook in earth oven’
Wirangu gamba- ‘cook, eat’
Kaytetye ampe- ‘burn’
Manjiljarra kampa ‘cook, burn’
Warlpiri kampa- ‘be burning – of fire; burn it – of fire’
Walmajarri kampa ‘cook it’
Nyangumarta kampa- ‘cook it’ (tr), ‘burn’ (intr)
Martuthunira kampa ‘be burning, be cooking’
Jiwarli kampa- ‘cook, burn’
Yingkarta kampa-ñi ‘be burning, be cooking’

Djabugay and Wirangu have a word initial g- for k-, and Wirangu has a -b- for -p-. There are many things going on: in some cases it looks as an orthographic preferences (the "descriptive practice" you mention), in others it is an articulatory information that is not contrastive (i.e., it is phonetic and phonological), in some cases it looks like stuff in free variation while in others it is positional, and so on.

While bowernpny, the Lexibank dataset, would surely benefit from reviewed individual profiles, we would still face these "issues". I am not sure how you should treat them for the study on correspondences.

LinguList commented 3 years ago

Please check with my latest commit, as I did already clean up the data more, since there were many invalid segments.

LinguList commented 3 years ago

I had this in another branch, now merged this with the master branch.

XachaB commented 3 years ago

Do you mean commit https://github.com/lexibank/bowernpny/commit/3e48d897e456a607c4091c5999a8b308a2a0d08b ? It is called "update to get rid of bad forms", but does not change forms.csv, so I am not sure I understand what are the changes.

erichround commented 3 years ago

Hi all,

It’s late here in Aus so I’ll be brief.

It would be good to see an updated list similar to what Sacha sent, but with Mattis’s committed fixes. The list Sacha sent contains very many languages with voiced and voiceless symbols where there’s no phonemic contrast (cf the Phoible inventories you mentioned, Tiago). Little of that variation will be due to careful phonetic transcriptions in the originals, rather it’ll be mostly idiosyncratic orthographic decisions by linguists that aren’t consistent across languages.

In almost every Australian language there are at least two rhotics, so you can trust that the ɹ/r contrast is mainly correct.

A quick question which will help us with the paper on sound correspondences: is Lexibank in general (beyond Australia) also a mix of allophones in some languages and phonemes in others? Or would it be overwhelmingly be phonemic?

Best, Erich

LinguList commented 3 years ago

Check the last version of lexibank script and orthography profile.

XachaB commented 3 years ago

Here is the updated table after checking out the current master version:

dataset Glottocode Language_ID sounds_contrasts
bowernpny pall1243 Pallanganmiddang b d g k p r t
bowernpny adny1235 angu1242 arab1267 awab1243 badi1246 badj1244 band1358 bang1339 bayu1240 bidy1243 biri1256 birr1241 bula1255 bung1264 cola1237 darl1243 dayi1244 dhal1245 dhan1270 dhar1247 dhar1248 dhud1236 dier1241 djam1256 djap1238 djin1253 djiw1241 flin1247 gami1243 gang1268 gath1234 gugu1255 guma1253 gund1249 guny1241 gupa1247 gure1255 guwa1242 guya1249 hawk1239 jaru1254 kala1377 kala1379 kara1476 kari1304 karr1236 kaur1267 kera1256 kuka1246 kuku1273 kula1275 kuun1236 leni1238 lowe1402 madh1244 malg1242 maly1234 marg1253 maya1280 mayi1234 mayi1235 mayi1236 mbab1239 minj1242 mith1236 narr1259 naru1238 ngad1258 ngam1284 ngar1235 ngar1287 ngar1296 ngar1297 ngaw1240 ngun1277 nhan1238 nhir1234 nort2760 nyam1271 nyun1247 pany1241 pirr1240 pitt1247 rirr1238 rita1239 sydn1236 thur1254 wadi1249 wadi1260 waga1260 waja1257 waju1234 wang1290 wang1291 ward1248 wari1262 warl1256 wath1238 wira1262 wira1265 woiw1238 wong1246 yabu1234 yaga1262 yala1262 yand1253 yann1237 yawa1258 yidi1250 yind1247 yira1239 yort1237 yuga1244 yuwa1242 Adnyamathanha Arabana Awabakal Badimaya Badjiri Bandjalang Biri Birrpayi Bularnu Bunganditj Colac Darkinyung Dhangu Dharawal Dharuk Dharumbal Dhayyi Dhudhuroa Diyari Djambarrpuyngu Djapu Djinang FlindersIsland Gamilaraay Gangulu GoorengGooreng Gumatj Gunditjmara Gundungurra Gunya Gupapuyngu GuuguYimidhirr Guwa Guyani Iyora Jaru Jiwarli Kamilaroi Karajarri Kariyarra Karuwali Katthang Kaurna Keramin Kukatja KukuYalanji Kungkari Kurnu Linngithigh Mabuiag Malgana Malyangapa Margany MathiMathi MayiKulan MayiKutuna MayiThakurti MayiYapi Mbabaram Mbakwithi Minjungbal Mirniny Mithaka Narrungga Ngadjumaya Ngaiawang Ngamini Ngarigu Ngarinyman Ngarla Ngarluma Ngarrindjeri Ngawun Ngiyambaa Ngunawal Nhanta Nhirrpi Nyamal Nyungar Paakantyi Panyjima Parnkala Payungu Piangil Pirriya PittaPitta Rirratjingu Ritharrngu Thalanyji Tharrgari Wajarri Wakaya Wangkangurru Wangkumara Wardandi Warluwarra Warriyangga Wathawurrung Wathiwathi Watjuk Wiradjuri Wirangu Woiwurrung YabulaYabula Yagara Yalarnnga Yandruwandha Yannhangu Yawarrawarrka Yidiny Yindjibarndi Yiningay Yirandali YortaYorta Yugambeh Yuwaalaraay b d g k p r t ɹ
bowernpny dyir1250 Dyirbal b d g k p r t ɹ ɽ
bowernpny kala1380 muru1266 yind1248 Garlali Muruwari Yindjilandji b d g k p r t ɹ ɾ
bowernpny gurd1238 Kurtjar b d g k p r t ɹ ʀ
bowernpny baty1234 bili1250 kumb1268 waru1264 Batyala Bilinarra Gumbaynggir Warungu b d g k p r ɹ
bowernpny warr1255 Wargamay b d g k p r ɹ ɽ
bowernpny kuku1280 ngad1257 KuguNganhcara Ngadjuri b d g k p t ɹ
bowernpny dhur1239 yaga1256 Dhurga Durubul b d g k r t ɹ
bowernpny duun1241 gudj1237 wulg1239 Duungidjawu Gudjal Wulguru b d g k r ɹ
bowernpny dyaa1242 Djabugay b d g p r t ɹ
bowernpny kung1258 Gunggari b d g r t ɹ
bowernpny kara1476 KarajarriNW b d g ɹ
bowernpny guri1247 paka1251 yany1243 yarl1238 Gurindji Pakanh Yanyuwa Yarluyandi b d k p r t ɹ
bowernpny kart1247 waru1265 Kartujarra Warumungu b g k p r t ɹ
bowernpny yulp1238 Yulparija b k p r t ɹ
bowernpny wikm1247 WikMungkan d g k p r t ɹ
bowernpny wang1289 Wangkayutyuru d k p r t ɹ
wold guri1247 Gurindji g k p r t
bowernpny alya1239 djuw1238 guga1239 kurr1243 mang1383 ngaa1240 nort2753 nort2754 nyan1301 pint1250 pint1251 warl1254 Alyawarr Jiwarliny Kukatj Kurrama Ngaanyatjarra Ngardily NorthernMangarla NorthernNyangumarta Nyangumarta PintupiLuritja Wangkatja Warlpiri g k p r t ɹ
bowernpny kanj1260 umpi1239 Kaanju Umpila k p r t
bowernpny ande1247 kalk1246 maln1239 mart1256 pitj1243 walm1241 wanm1242 warl1255 west2441 Antekerrepenhe Kalkatungu Malngin MartuWangka Pitjantjatjara SouthernWalmajarri Warlmanpa Warnman WesternArrarnta k p r t ɹ
johanssonsoundsymbolic joophonosemantic ngar1287 pitj1243 Ngarluma Pitjantjatjara k p r t ɽ
bowernpny kuuk1238 KuukuYau k p t ɹ
XachaB commented 3 years ago

For example, for Pallanganmiddang, Phoible gives the sounds: "a e i j k l l̪ m n n̪ o p r t t̪ u w ŋ ȴ ȵ ȶ ɭ ɳ ɻ ʈ", but here we have (among others): "b d g k p r t". Phoible (this one is from @erichround ), does not distinguish k/g p/b t/d.

Is there a way for us to compare automatically, language by language, the set of sounds given in phoible and the set of sounds used in a dataset ? We might catch a lot of small variations in notation by doing that systematically.

LinguList commented 3 years ago

My PR to lexicore essentially enumerates all the features. If you wait until I have finished the work on pylexicore, you can do this very easily.

XachaB commented 3 years ago

Amazing ! That's excellent. Looking forward to it !

LinguList commented 3 years ago

Please see here for an example of how this can be done.

chirila commented 3 years ago

It's possible that there are some but you can't look at this through the dataset as a whole, since these are points on which the languages vary. For example, some languages have only two rhotic phonemes (IPA ɹ and r, practical orthography r and rr); others have three, either ɹ, r and ɾ (or ɽ) or sometimes written as something else. LIkewise with voicing, where some languages only have a marginal contrast. I'm happy to look at Mattis' examples but I don't have access to lexibank. If you want to give me access I"m happy to take a look through. Claire

On Thu, Nov 26, 2020 at 8:15 AM Johann-Mattis List notifications@github.com wrote:

Please see here https://github.com/lexibank/pylexicore/blob/main/examples/pamanyungan/report.md for an example of how this can be done.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lexibank/bowernpny/issues/12#issuecomment-734292287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7SQR6NNFDN25K27D2FOFDSRZIIJANCNFSM4UDQF6SQ .

--

Claire Bowern Professor Editor: Diachronica Department of Linguistics, Yale University she/her or they/them

LinguList commented 3 years ago

You were sent an invitationow

chirila commented 3 years ago

Pallanganmiddang is probably not the best source for an example as it's reconstituted from old sources. Blake and Reid (1999) write both voiced and voiceless segments, since both occur both initially and medially, but it's not too clear if they actually contrast. For vowel length, the spellings in the old sources suggest both long and short vowels but the material is not systematic enough to be sure.

On Thu, Nov 26, 2020 at 7:18 AM Sacha notifications@github.com wrote:

For example, for Pallanganmiddang, Phoible gives the sounds: "a e i j k l l̪ m n n̪ o p r t t̪ u w ŋ ȴ ȵ ȶ ɭ ɳ ɻ ʈ", but here we have (among others): "b d g k p r t". Phoible (this one is from @erichround https://github.com/erichround ), does not distinguish k/g p/b t/d. Is vowel length distinctive in Pallanganmiddang ?

Is there a way for us to compare automatically, language by language, the set of sounds given in phoible and the set of sounds used in a dataset ? We might catch a lot of small variations in notation by doing that systematically.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lexibank/bowernpny/issues/12#issuecomment-734266259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7SQR7VOYCA56LNE5MIALDSRZBSVANCNFSM4UDQF6SQ .

--

Claire Bowern Professor Editor: Diachronica Department of Linguistics, Yale University she/her or they/them

chirila commented 3 years ago

thanks, got it.

It looks like there are a fair number of just different glyph choices here. e.g. ɽ or ɻ vs ɹ; ʈ vs ƫ, ţ, or ȶ (though these of course mean different things). For Adnyamathanha v is equivalent to β (alternative representations of the same segment). Phoible in general goes for maximal specificity of representation whereas both @erichround and I have gone more for comparability.

On Fri, Nov 27, 2020 at 11:59 AM Johann-Mattis List < notifications@github.com> wrote:

You were sent an invitationow

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lexibank/bowernpny/issues/12#issuecomment-734921501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD7SQR6CGSXIIND5A4ZN4E3SR7LHDANCNFSM4UDQF6SQ .

--

Claire Bowern Professor Editor: Diachronica Department of Linguistics, Yale University she/her or they/them

XachaB commented 3 years ago

Thanks ! How could we go about getting to a better representation for this dataset ? I understand the first step is to have separate profiles for each language, but beyond that, I am not sure how to decide how to decide on a specific transcription. Should we follow Phoible, and it not, how can we have more comparable representations ? In the current state, I think the transcription here follows neither Phoible nor a more comparable inventory set.