Closed davidainman closed 2 years ago
Thanks for the report! I confirm that we do seem to have some incorrect diacritics. However, reopening #102 is not the right move... that issue was specifically about a palatal stop in the Ramaswami data source. If I do
git grep dʰ
on the current repository I don't get any hits inside raw-data/RA
, they seem to all be from the EA
data source. Looks like we can just update this line:
https://github.com/phoible/dev/blob/master/raw-data/EA/EA_IPA_correspondences.tsv#L190
and it will fix them all (the feature vectors are auto-computed nowadays so fixing the diacritic should fix the features too). Similar fix will apply to nʰ and rʰ (both only show up in languages from EA). Are there any other cases of voiceless-h-diacritic-on-voiced-glyph you've noticed @davidainman? Or is it just those three?
digging a little deeper, this is perhaps not so straightforward. EA/EA_IPA_correspondences.tsv
has entries for both dʰ and dʱ so presumably they're meant to represent different sounds; I can't blindly map dʰ to dʱ. I mean, I could, because there aren't any languages that have both:
phoible %>%
group_by(InventoryID) %>%
filter("dʰ" %in% Phoneme & "dʱ" %in% Phoneme) %>%
select(InventoryID, Glottocode, LanguageName, Source) %>%
distinct()
# A tibble: 0 x 4
# Groups: InventoryID [0]
# … with 4 variables: InventoryID <int>, Glottocode <chr>, LanguageName <chr>, Source <chr>
But I'd prefer to honor the intent and figure out what a proper phoible-style-IPA representation of dʰ ought to be.
FYI, here are the affected inventories:
phoible %>%
group_by(InventoryID) %>%
filter("dʰ" %in% Phoneme | "nʰ" %in% Phoneme | "rʰ" %in% Phoneme) %>%
select(InventoryID, Glottocode, LanguageName, Source) %>%
distinct()
# # A tibble: 9 x 4
# # Groups: InventoryID [9]
# InventoryID Glottocode LanguageName Source
# <int> <chr> <chr> <chr>
# 1 2357 bhoj1244 Bhojpuri ea
# 2 2454 maga1260 Magahi ea
# 3 2463 dhim1246 Dhimal ea
# 4 2498 east2304 Eastern Hill Balochi ea
# 5 2528 kork1243 Korku ea
# 6 2544 east2347 Tamang ea
# 7 2550 sant1410 Santali ea
# 8 2559 mala1464 Malayalam ea
# 9 2598 assa1263 Assamese ea
Okay. This was affecting work I was doing with some colleagues using PHOIBLE data. Should I wait for now (do you have time to fix this in the next few weeks) or do an internal transform/ignore these segments?
@davidainman I've opened a pull request to fix this issue. In the end I did end up "collapsing" things like dʰ
and dʱ
(and also d̤ʱ
) on the grounds that there were no inventories where more than one of those was present, and after checking the sources of several of the affected languages I didn't find anything to suggest that anything other than breathy voicing was intended (it's possible I'm wrong here, I didn't check every single source). I think this is a case of @macleginn (compiler of EA) preserving the symbols used in the original resources, and us not applying our symbol conventions properly when we import the EA data. But if I'm wrong about this @macleginn please chime in!
@davidainman if you can't wait for the PR to be merged, you can access the "fixed" data CSV from that PR here: https://raw.githubusercontent.com/drammock/phoible/fix-ea-breathy/data/phoible.csv
Yes, I allowed some variation in the notation and treated "murmured" as "voiced aspirated", but I should've normalised /dʱ/ to /dʰ/, think. Breathy voice, however, is treated as a separate category; I am not sure to what extent breathy-voiced stops and murmured stops are the same thing.
I have discovered voiced stops with a superscript h (see comment in closed issue #102). These segments also have incorrect feature arrays which make them appear voiceless aspirated rather than voiced breathy.