sigmorphon / 2020

SIGMORPHON 2020 Shared Task: Grapheme-to-Phoneme, Unsupervised Induction of Morphology, and Typologically Diverse Morphological Inflection
35 stars 12 forks source link

Task 1: Inconsistent use of IPA symbols in Georgian #8

Closed besou closed 3 years ago

besou commented 4 years ago

I have noticed that in the training (and dev) data for Georgian, there are three phonemes that are each represented by two IPA symbols without consistency (and distributed roughly 50/50): i ~ ɪ; x ~ χ; ɣ ~ ʁ

These differences are not phonemic. For instance, none of the phoneme inventories on Phoible assumes a phonemic contrast between these sounds (see https://phoible.org/languages/nucl1302). However, some inventories on Phoible use [i] as a phoneme, and others [ɪ], and similarly for the other pairs. Notably, none of the inventories mentions the other IPA symbol of each pair (the one not chosen as a phoneme) as an allophone (for instance, inventory 'SPA 5' has [ɪ̠] as an allophone of [ɪ], but no [i], and there is no [i] anywhere in the inventory). This is an indication that it these symbols are rather "allographs" for mapping Georgian to IPA than actual allophones in Georgian.

It might of course be possible that the data on Phoible is not complete, and that these sounds are still actual allophones in Georgian. However, there is strong evidence in the training and dev data, that the distribution of these IPA symbols is at least close to random, as the following (non-exhaustive) examples from the training set illustrate (note that the Georgian character for the IPA symbol in question is always identical for both arguable allophones):

ɣ ~ ʁ
აღნიშვნა ɑ ʁ n ɪ ʃ v n ɑ
აღნიშნული ɑ ɣ n i ʃ n u l i
დღეები d ɣ ɛ ɛ b i
დღემ d ʁ ɛ m
დღეს d ɣ ɛ s
ღირს ɣ i r s
ღირსება ʁ ɪ r s ɛ b ɑ
ღირსეულად ʁ ɪ r s ɛ u l ɑ d
ღირსეული ɣ i r s ɛ u l i
ღრმა ɣ r m ɑ
ღრმად ʁ r m ɑ d
x ~ χ
ახალ ɑ χ ɑ l
ახალგაზრდა ɑ x ɑ l ɡ ɑ z r d ɑ
ახალგაზრდობა ɑ x ɑ l ɡ ɑ z r d ɔ b ɑ
ახალგაზრდული ɑ χ ɑ l ɡ ɑ z r d u l ɪ
ახლა ɑ x l ɑ
ახლად ɑ χ l ɑ d
ახლანდელი ɑ χ l ɑ n d ɛ l ɪ
ახლახან ɑ χ l ɑ χ ɑ n
გადაიხდიან ɡ ɑ d ɑ ɪ χ d ɪ ɑ n
გადაიხდის ɡ ɑ d ɑ i x d i s
განსხვავება ɡ ɑ n s x v ɑ v ɛ b ɑ
განსხვავებით ɡ ɑ n s χ v ɑ v ɛ b ɪ tʰ
განსხვავებული ɡ ɑ n s x v ɑ v ɛ b u l i
ვერცხლი v ɛ r t s x l i
ვერცხლით v ɛ r t s χ l ɪ tʰ
ვერცხლის v ɛ r t s χ l ɪ s
თანახმა tʰ ɑ n ɑ x m ɑ
თანახმად tʰ ɑ n ɑ χ m ɑ d
იხდი ɪ χ d ɪ
იხდიდა ɪ χ d ɪ d ɑ
იხდის i x d i s
მოსახლეობა m ɔ s ɑ x l ɛ ɔ b ɑ
მოსახლეობამ m ɔ s ɑ χ l ɛ ɔ b ɑ m
მოსახლეობას m ɔ s ɑ χ l ɛ ɔ b ɑ s
მოსახლეობის m ɔ s ɑ χ l ɛ ɔ b ɪ s
i ~ ɪ
ბიბლიოთეკა b i b l i ɔ tʰ ɛ kʼ ɑ
ბიბლიოთეკარი b ɪ b l ɪ ɔ tʰ ɛ kʼ ɑ r ɪ
ბიძა b i d z ɑ
ბიძინა b ɪ d z ɪ n ɑ
გაბრაზებული ɡ ɑ b r ɑ z ɛ b u l i
გაბრუებული ɡ ɑ b r u ɛ b u l ɪ
გაკეთებული ɡ ɑ kʼ ɛ tʰ ɛ b u l i
გაკვირვებული ɡ ɑ kʼ v ɪ r v ɛ b u l ɪ
ინგლისელი ɪ n ɡ l ɪ s ɛ l ɪ
ინგლისი i n ɡ l i s i
ინგლისის ɪ n ɡ l ɪ s ɪ s
ირლანდია i r l ɑ n d i ɑ
ირლანდიური ɪ r l ɑ n d ɪ u r ɪ
ისტორია i s tʼ ɔ r i ɑ
ისტორიამ ɪ s tʼ ɔ r ɪ ɑ m
ისტორიას ɪ s tʼ ɔ r ɪ ɑ s
ისტორიით ɪ s tʼ ɔ r ɪ ɪ tʰ
იხდი ɪ χ d ɪ
იხდიდა ɪ χ d ɪ d ɑ
იხდის i x d i s

About 75% of the training samples contain at least one of the IPA symbols in question, which makes it a serious problem. If the allophones do not have a distribution conditioned on the context (and this is clearly the case in the examples above), it is impossible to predict them better than by chance.

One beautiful thing about Georgian is that it has a completely phonemic alphabet. There is exactly one character for each phoneme in the alphabet, and hence a completely regular mapping is possible in principle (just like in Hungarian, where the baseline scores are already very strong because of its phonemic orthography). With the given dataset, it seems that Georgian is a language hard to tackle. This is not true, it is in fact one of the easiest languages for the grapheme-to-phoneme task.

I kindly ask you to clean the training (dev, test) data for Georgian by chosing only one of the IPA symbols above for each pair, and replacing the other one with the chosen one. As you can see on Phoible and in any grammar book on Georgian, no phonemic contrast will get lost – and most likely not even an allophonic contrast – by doing this.

PS: In addition to the topic above, there is one obscure sample in the training set which contains three phonemes (t, ʊ, ɾ) not found anywhere else in both the training and the dev set: თურქული t ʊ ɾ kʰ ʊ l i. This should be changed to tʰ u r kʰ u l i for constistency. In addition, the following samples have either a sound missing (first two) or one sound that should not be there (third and fourth):

orthography incorrect corrected
გილოცავთ ɡ i l ɔ t s ɑ v ɡ i l ɔ t s ɑ v tʰ
ცოტას t s ɔ tʼ ɑ t s ɔ tʼ ɑ s
უპრეცედენტო u pʼ r ɛ t s ɛ n d ɛ n tʼ ɔ u pʼ r ɛ t s ɛ d ɛ n tʼ ɔ
ქვეყნად kʰ v ɛ qʼ ɑ n ɑ d kʰ v ɛ qʼ n ɑ d
kylebgorman commented 4 years ago

Thanks for this detailed report. If you'd like to coordinate this cleaning effort:

If you can get this done in the next few days (and much beyond that I think it's too late to make the change) then we can re-run the scrape and percolate the changes down to the task 1 data sets.

I created an issue over at the Wikipron repo to track this.