phoible / dev

PHOIBLE data and development.
https://phoible.org/
GNU General Public License v3.0
121 stars 30 forks source link

voiceless asipiration diacritic on voiced base glyphs (d, n, r) in EA #346

Closed davidainman closed 2 years ago

davidainman commented 2 years ago

I have discovered voiced stops with a superscript h (see comment in closed issue #102). These segments also have incorrect feature arrays which make them appear voiceless aspirated rather than voiced breathy.

drammock commented 2 years ago

Thanks for the report! I confirm that we do seem to have some incorrect diacritics. However, reopening #102 is not the right move... that issue was specifically about a palatal stop in the Ramaswami data source. If I do

git grep dʰ

on the current repository I don't get any hits inside raw-data/RA, they seem to all be from the EA data source. Looks like we can just update this line:

https://github.com/phoible/dev/blob/master/raw-data/EA/EA_IPA_correspondences.tsv#L190

and it will fix them all (the feature vectors are auto-computed nowadays so fixing the diacritic should fix the features too). Similar fix will apply to nʰ and rʰ (both only show up in languages from EA). Are there any other cases of voiceless-h-diacritic-on-voiced-glyph you've noticed @davidainman? Or is it just those three?

drammock commented 2 years ago

digging a little deeper, this is perhaps not so straightforward. EA/EA_IPA_correspondences.tsv has entries for both dʰ and dʱ so presumably they're meant to represent different sounds; I can't blindly map dʰ to dʱ. I mean, I could, because there aren't any languages that have both:

phoible %>%
    group_by(InventoryID) %>%
    filter("dʰ" %in% Phoneme & "dʱ" %in% Phoneme) %>%
    select(InventoryID, Glottocode, LanguageName, Source) %>%
    distinct()
# A tibble: 0 x 4
# Groups:   InventoryID [0]
# … with 4 variables: InventoryID <int>, Glottocode <chr>, LanguageName <chr>, Source <chr>

But I'd prefer to honor the intent and figure out what a proper phoible-style-IPA representation of dʰ ought to be.

FYI, here are the affected inventories:

phoible %>%
    group_by(InventoryID) %>%
    filter("dʰ" %in% Phoneme | "nʰ" %in% Phoneme | "rʰ" %in% Phoneme) %>%
    select(InventoryID, Glottocode, LanguageName, Source) %>%
    distinct()
# # A tibble: 9 x 4
# # Groups:   InventoryID [9]
#   InventoryID Glottocode LanguageName         Source
#         <int> <chr>      <chr>                <chr> 
# 1        2357 bhoj1244   Bhojpuri             ea    
# 2        2454 maga1260   Magahi               ea    
# 3        2463 dhim1246   Dhimal               ea    
# 4        2498 east2304   Eastern Hill Balochi ea    
# 5        2528 kork1243   Korku                ea    
# 6        2544 east2347   Tamang               ea    
# 7        2550 sant1410   Santali              ea    
# 8        2559 mala1464   Malayalam            ea    
# 9        2598 assa1263   Assamese             ea   
davidainman commented 2 years ago

Okay. This was affecting work I was doing with some colleagues using PHOIBLE data. Should I wait for now (do you have time to fix this in the next few weeks) or do an internal transform/ignore these segments?

drammock commented 2 years ago

@davidainman I've opened a pull request to fix this issue. In the end I did end up "collapsing" things like and (and also d̤ʱ) on the grounds that there were no inventories where more than one of those was present, and after checking the sources of several of the affected languages I didn't find anything to suggest that anything other than breathy voicing was intended (it's possible I'm wrong here, I didn't check every single source). I think this is a case of @macleginn (compiler of EA) preserving the symbols used in the original resources, and us not applying our symbol conventions properly when we import the EA data. But if I'm wrong about this @macleginn please chime in!

@davidainman if you can't wait for the PR to be merged, you can access the "fixed" data CSV from that PR here: https://raw.githubusercontent.com/drammock/phoible/fix-ea-breathy/data/phoible.csv

macleginn commented 2 years ago

Yes, I allowed some variation in the notation and treated "murmured" as "voiced aspirated", but I should've normalised /dʱ/ to /dʰ/, think. Breathy voice, however, is treated as a separate category; I am not sure to what extent breathy-voiced stops and murmured stops are the same thing.