phoible / dev

PHOIBLE data and development.
https://phoible.org/
GNU General Public License v3.0
115 stars 30 forks source link

Najdi Arabic phoneme inventory is missing items #332

Open camoverride opened 3 years ago

camoverride commented 3 years ago

Ground truth wikipedia vs Phoible Najdi Arabic Inventory page

I also confirmed this by inspecting the appropriate lines in data/phoible.csv

I discovered this with the following SQL query, where I searched for languages lacking phonemes with the + nasal feature (Najdi Arabic will be the last entry returned by this query):

SELECT x.LanguageName, SUM(x.nasal) AS num_nasals
FROM (SELECT InventoryID, LanguageName,
CASE WHEN nasal = '-' THEN 0 ELSE 1 END AS nasal
FROM phoible) AS x
GROUP BY InventoryID, x.LanguageName
ORDER BY num_nasals ASC
LIMIT 16

('Najdi Arabic', 0)

bambooforest commented 3 years ago

@camoverride -- thanks for pointing this out and sending some reproducible code. i'll look into it.

drammock commented 3 years ago

@camoverride I'm chiming in here to provide a clarification: "ground truth" in this case is not Wikipedia, but rather Ingram 1994 (https://phoible.org/sources/67053). Phoneme inventories in PHOIBLE are not meant to represent a language but rather a particular instance of language documentation. It certainly can happen that we make a mistake in converting the analysis in Ingram 1994 into a PHOIBLE entry, but if the "mistake" here is that Ingram disagrees with other scholars about Najdi Arabic's phonology, that is a disagreement that we're interested in preserving.

bambooforest commented 3 years ago

This issue is a bit more complicated than that, I think. I looked into the grammar by Ingram and indeed it does not list nasals among its consonants, but you nevertheless find them in the word forms in the grammar.

After some discussion with @macleginn (this particular inventory is from (an earlier version of) EURPhon https://eurphon.info/languages/html?lang_id=135), he told me that he encountered some systematic omission of nasals (and liquids and rhotics) from descriptions of Arabic dialects.

He put it rather succinctly to me: "There are doculects that demand some amount of hermeneutics, unfortunately."

A point for discussion.

camoverride commented 3 years ago

Gotcha, Ingram as ground truth definitely makes sense - and it's reasonable not to want to play around with the source material too much.

However, have you all developed a general strategy for tracking known "errors" in Ingram? Maybe it could act as a secondary source to augment phoible?

xrotwang commented 3 years ago

Just a somewhat technical note regarding a"secondary source to augment phoible": I think that's a good idea - some sort of curated errata. And that's exactly one of the use cases we had in mind when designing CLDF to allow for easy merging: Such an errata list could be distributed in the same overall format as PHOIBLE, and then be transparently used to override PHOIBLE data in specified cases.

However, in this particular case, I'd be a bit hesitant. I think the strength of PHOIBLE lies in in it being principled and complete. So for any use case that looks at all of PHOIBLE, an augmented phoible would also have to be complete to not diminish the PHOIBLE strength. "some amount of hermeneutics" doesn't really sound like systematic errata which can be fixed wholesale.

bambooforest commented 3 years ago

@camoverride -- if you mean by tracking down errors, it depends on what one means by errors. As @drammock notes, above, inventories in phoible reflect doculects and in this case, at least for the missing nasals in Ingram's grammar, we would still be true to the original source because it does not list them in the consonants.

We have always had the issue of systematic gaps in full database sources, e.g. UPSID contains purposely no tones. But since this inventory is from EURPhon, if it gets updated by their editors to address systematic gaps in certain areal linguistic practice (e.g. some semitic language descriptions systematically leaving out nasals, etc.) then EURPhon becomes more like UPSID in the sense that multiple doclects may be used for a single inventory and some typologicalization may occur.

I think we will need to be clearer about such cases in our documentation moving forward, especially if some source editors identify systematic gaps and fill them without attributing multiple doculects.

xrotwang commented 3 years ago

Yes, PHOIBLE being already the second-level aggregator makes things trickier. And since one of the the big advantages of PHOIBLE is machine-readable data, it would be nice, if documentation about systematic gaps - along the lines of the "no tones in UPSID" would be machine readable, too. But I don't really have an idea how to do that. There doesn't seem to be established terminology for "complete inventory" or "complete inventory without loans" or "complete inventory without tones" which could serve as basis for some sort of ontology.

bambooforest commented 3 years ago

Sounds like an ontology is in order. :)

xrotwang commented 3 years ago

@bambooforest you're the aggregator, you get to choose the categories :)

bambooforest commented 3 years ago

@xrotwang sounds good. And you have / will have a place for them in CLDF :)