Closed besou closed 3 years ago
Thanks for this detailed report. If you'd like to coordinate this cleaning effort:
wikipron --no-segment georgian > geo.tsv
(may take a half hour or so) to collect the full list of Wiktionary pronunciations as of today
2) add a third column for all the forms that need to be changed.
Then we can try to figure out how to do a bulk upload to Wiktionary using that list.If you can get this done in the next few days (and much beyond that I think it's too late to make the change) then we can re-run the scrape and percolate the changes down to the task 1 data sets.
I created an issue over at the Wikipron repo to track this.
I have noticed that in the training (and dev) data for Georgian, there are three phonemes that are each represented by two IPA symbols without consistency (and distributed roughly 50/50): i ~ ɪ; x ~ χ; ɣ ~ ʁ
These differences are not phonemic. For instance, none of the phoneme inventories on Phoible assumes a phonemic contrast between these sounds (see https://phoible.org/languages/nucl1302). However, some inventories on Phoible use [i] as a phoneme, and others [ɪ], and similarly for the other pairs. Notably, none of the inventories mentions the other IPA symbol of each pair (the one not chosen as a phoneme) as an allophone (for instance, inventory 'SPA 5' has [ɪ̠] as an allophone of [ɪ], but no [i], and there is no [i] anywhere in the inventory). This is an indication that it these symbols are rather "allographs" for mapping Georgian to IPA than actual allophones in Georgian.
It might of course be possible that the data on Phoible is not complete, and that these sounds are still actual allophones in Georgian. However, there is strong evidence in the training and dev data, that the distribution of these IPA symbols is at least close to random, as the following (non-exhaustive) examples from the training set illustrate (note that the Georgian character for the IPA symbol in question is always identical for both arguable allophones):
About 75% of the training samples contain at least one of the IPA symbols in question, which makes it a serious problem. If the allophones do not have a distribution conditioned on the context (and this is clearly the case in the examples above), it is impossible to predict them better than by chance.
One beautiful thing about Georgian is that it has a completely phonemic alphabet. There is exactly one character for each phoneme in the alphabet, and hence a completely regular mapping is possible in principle (just like in Hungarian, where the baseline scores are already very strong because of its phonemic orthography). With the given dataset, it seems that Georgian is a language hard to tackle. This is not true, it is in fact one of the easiest languages for the grapheme-to-phoneme task.
I kindly ask you to clean the training (dev, test) data for Georgian by chosing only one of the IPA symbols above for each pair, and replacing the other one with the chosen one. As you can see on Phoible and in any grammar book on Georgian, no phonemic contrast will get lost – and most likely not even an allophonic contrast – by doing this.
PS: In addition to the topic above, there is one obscure sample in the training set which contains three phonemes (t, ʊ, ɾ) not found anywhere else in both the training and the dev set: თურქული
t ʊ ɾ kʰ ʊ l i
. This should be changed totʰ u r kʰ u l i
for constistency. In addition, the following samples have either a sound missing (first two) or one sound that should not be there (third and fourth):