[Orthography] Ulithian: many duplicate forms from two different sources

LinguList commented 3 years ago

Screenshot 2021-08-25 at 13-51-26 EDICTOR

LinguList commented 3 years ago

The problem of this outlier in the data (45 consonants!) is that it seems to have been composed of two different sources using two different alphabets. It seems this can only be manually resolved.

LinguList commented 3 years ago

Should we exclude it for now from the data?

maryewal commented 3 years ago

Do we have two separate lists for it or are these sources combined into one set? If there is just one set, it probably is okay to simply exclude - we have decent coverage of other Micronesian languages and, as far as I can tell, this one is not super unique phonologically.

SimonGreenhill commented 3 years ago

It's all from a single source: https://abvd.shh.mpg.de/austronesian/language.php?id=1180, I suspect what has happened is that there is a mix of ipa and orthographic forms, I don't have this dictionary on hand but wikipedia says:

Ulithian has eight vowels which is a large amount for a Pacific language. They are /i/, /u/, /e/, /ə/, /ɔ/, /æ/, /ɐ/, /a/. They are spelled i, u, e, oe or ȯ, o, ae or ė, oa or a, a or ȧ.

...so maybe forms with 'oe', 'ȯ', 'ae' could be removed, e.g. the first entry here:

48 | to sleep | maesoer |   | 1 |  
48 | to sleep | mawsur |   | 1 |

It would be good not to exclude it (the more languages covered the better for the phylogeny).

maryewal commented 3 years ago

I see, so there are essentially double forms for every concept? In that case, it probably wouldn't take too long to just manually exclude one of the two, since the IPA forms will be pretty obvious. @LinguList do you think that would solve the issue?

LinguList commented 3 years ago

@maryewal, I just discussed a potential way to proceed with @antipodite, which consists in adding a new file etc/ignore.tsv, in which we list the language (by ID), and the forms with their original VALUE to be excluded. We can then blacklist the entries.

What would this entail? It is in fact not very difficult, as I think:

open cldf/forms.csv in excel or libre office
copy-paste only forms for Ulithian
extract only the two columns with Value and Language_ID
manually quickly go over the data and kick out those word forms which we want to retain (the good ones)

@antipodite, would you be able to do a quick check of this very language to see how long this takes? If it takes more 20-30 minutes, I think it is worth it, and we could later even outsource this work to student assistants.

LinguList commented 3 years ago

Ah, just to add this: since we discussed this with @antipodite on another matter, it means that this is not the only case, so worth checking how well that workflow works.

antipodite commented 3 years ago

OK, on it. So placing the non-IPA forms in ignore.tsv as we discussed

antipodite commented 3 years ago

Not obvious to me which is which. @maryewal can you have a look at this? Screenshot 2021-08-26 at 14 04 16

maryewal commented 3 years ago

yep, not as obvious as I'd hoped without clear IPA. Simon is definitely right that one entry is probably orthography and the other some sort of pronunciation guide. This is because the data is based on a 2010 dictionary for students. It is partially online https://www.yumpu.com/en/document/read/11736907/ulithian-english-dictionary-habele, where we can see a pronunciation guide is the second entry in parentheses. So, perhaps get rid of all values that correspond to the orth. entries in the dictionary.

The full book does have a section on "orthography" and another on "spelling and pronunciation" but I can't find access to it. In many cases, we will probably be able to guess the right sound from what is written for pronunciation (eg. ngal is probably ŋal), but I'm not totally comfortable making assumptions for the whole set...

LinguList commented 3 years ago

I'd say that even if you make wrong decisions, as long as you preserve only one form, you enhance the data in many ways. There are obvious markers of certain pronuncation distinctions, like two vowels oo or th etc., so singling out these cases at first, then reordering, etc., should help to narrow this down.

maryewal commented 3 years ago

Noted, @LinguList - let's see how far we can get, then! @antipodite do you want to do a first removal of the "orthographic" forms, based on the dictionary? Meanwhile, I can come up with likely distinctions.

antipodite commented 3 years ago

OK, I filter forms.tsv to Ulithian and then sort ascending by ID. Now it looks like we have pairs (mostly, some triples) where the first element is the orthographic form and the second is the pronunciation guide. I will attach a modified Ulithian spreadsheet with my judgement of orth. vs pronunciation guide forms marked in a new column shortly so you can check them also @maryewal. Then I can just filter out the ones we don't want and put them in ignore.tsv

antipodite commented 3 years ago

Here it is. 1 in the "Orth." column means this is orthography, empty cell means either pronunciation guide or orth and pronunciation guide are the same. I cross checked with the dictionary. Note that some have multiple guide pronunciations

ulithian-orthography-judgments.csv

LinguList commented 3 years ago

So which is the one you'd retain? The orth?

LinguList commented 3 years ago

Ah, @antipodite, in order to get this rolling, can you now post the words to REMOVE to a file etc/ignore.tsv, where you give me three values, as discussed before, e.g., for a form paththba (which I just invented)

Language_ID	Parameter_Name	Value	Comment
Ulithian	hand	gumchiu	duplicate

This TSV file would then serve as the basis to exclude entries from being listed as "normal" entries.

antipodite commented 3 years ago

@mattis: Done, check pull requests. I put the pronunciation guide forms in for now, I think this is the better option to ignore as often English words are used as part of the pronunciation guide which would probably screw with the orth profile algo. Regardless I think the orthography profile will need quite a bit of manual correction, as the orthography of this language is somewhat quirky: d -> [θ] or [ð], e -> [i], some consonants seem to be written but not pronounced, etc

antipodite commented 3 years ago

want me to have a go at plumbing in ignore.tsv? Seems like you would just filter against ignore.tsv in the def cmd_makecldf(self, args): fn definition in lexibank_abvdoceanic.py

maryewal commented 3 years ago

@antipodite, based on what you say, it seems sensible to remove the pronunciation guide forms. Do you still want me to have a look at this?

LinguList commented 3 years ago

So we insert an if-else check before this line:

https://github.com/lexibank/abvdoceanic/blob/8fb24c8d9c455e4e2e502b0b2bfe8154941463d8/lexibank_abvdoceanic.py#L105-L117

@antipodite, before this line, you can check for same value, language ID and concept id:

lid = slug(wl.language.name, lowercase=False)
if ignored.get(lid, cid, entry.name):
    continue

Before, e.g., right after def cmd_makecldf you load the ignored list:

ignored = {(row[0], row[1], row[2]): row[3] for row in self.etc_dir.read_csv("ignore.tsv", delimiter="\t")}

You have to play a round with this, as I did not test, but along those lines, it should work, and I gladly review this, if you make another PR and assign me as a reviewer.

antipodite commented 3 years ago

@maryewal I think we should just go ahead with removing the guide forms. No need to look at it again. @mattis cool, I'll have a look some time tomorrow. I guess it would be worth checking the generated profiles for other micronesian languages too as I recall Pohnpeian, Woleaian etc have similarly quirky orthographies

maryewal commented 2 years ago

Can we close this?

antipodite commented 2 years ago

yup, we sorted this iirc

lexibank / abvdoceanic

[Orthography] Ulithian: many duplicate forms from two different sources #16