Inconsistency in Lithuanian data

sigmorphon / 2020

SIGMORPHON 2020 Shared Task: Grapheme-to-Phoneme, Unsupervised Induction of Morphology, and Typologically Diverse Morphological Inflection

34 stars 12 forks source link

Inconsistency in Lithuanian data #2

Open Nofenigma opened 4 years ago

Nofenigma commented 4 years ago

[Here I will omit my suspicions on Wiktionary data in general and on their transcriptions in particular.] I have noticed that the diphtong ie is treated somewhat weird in the data, as yot (j) almost always appears there. Still there is some inconsistency even there (lit_dev.tsv), cf. lietuvis (lʲ j ɛ t ʊ ʋʲ ɪ s) and pieštukus (pʲ i ɛ ʃ t ʊ k ʊ s).

See also the table from the comprehensive "Lithuanian Grammar" by Ambrazas et al. (1997):

Check also the differences between liepsna and lietuviškai, pietūs (pʲ ɪ ɛ t̪ uː s̪) and pieštukai (pʲ i ɛ ʃ t ʊ k ɐ j).

Moreover, I suspect that in the data some diacritics are only occasionally used, mainly underties for diphtongs. Is piratas = pʲ ɪ r ä̌ː t̪ ɐ s̪ ??? Why at all, when baltas = b aː l t ɐ s? (spoiler: the last syllables are really the same here). Are you sure that this is what you really want the model to predict and whether each case where just an undertie is lacking should be considered as an error by your scorings?

kylebgorman commented 4 years ago

I suspect the double-transcription of glides is probably in error. It strikes me as very unlikely that any language contrasts /Cʲ j/ and /C j/ where C is some constant. I suspect lietuvis is the error there.

I also think your suspicion about the underbar indicating dental (vs. alveolar) is probably right, but I don't know enough about the language to do anything about it.

If you or someone you know is knowledgeable about the language and willing to assist in some kind of correction effort, please get in touch with me off-thread and we can discuss it. In my lab two students are trying to develop methods to correct deficiencies with the English data. and ultimately get those corrections back onto Wiktionary. I am not sure if we can do this within the timeline of the shared task, though...TBD.

(Sidebar: for many languages Wiktionary provides a Help: page with information about how to make the orthography onto IPA. Here's the one for Lithuanian, which mentions offglides but not a dental/alveolar contrast: https://en.wikipedia.org/wiki/Help:IPA/Lithuanian)

antecedent commented 4 years ago

I speak Lithuanian natively. I'm also familiar with its various phonological analyses and the traditions of transcribing it. (I also don't have any academic credentials to date, but I guess I can have my professors double-check whatever I suggest.)

Really, if it's Task 1 we're speaking of, the data has much more noise than one would expect. I honestly think nobody's at fault here and we might as well have had to learn a lesson about Wiktionary's lack of consistency the hard way. But if we can still correct the data and make this competition a fairer one, I'd be glad to help.

@Nofenigma's suspicions are accurate. For another small but illustrative instance, the \ vowel is uniformly (see notes) /uə/, and is also a single segment. In the dev-data, however, we find only bi-segmental /u + ɔ/, /ʊ + o/, /u + o/, /ʊ + ɔ/ and /u + ə/.

I'll try to compile the inconsistencies into a list today.

Also, for an up-to-date description of the phonemic inventory, see Baltic linguistics – State of the art.

Edit: also, this: http://www.esparama.lt/es_parama_pletra/failai/ESFproduktai/2014__Theoretical_Foundations_of_Lithuanian_Phonology.pdf (authoritative and comprehensive, but uses a traditional phonetic alphabet instead of IPA).

Note 1: \ is also /ʊ + ɔ/ or /ʊ + oː/ whenever there's hiatus. There isn't a hiatus anywhere in the dev-data, though.

Note 2: /uə/, with a [ə], is what we've been doing for at least a decade already, but any other transcription that is uniform would obviously work as well.

Note 3: there's also a two-way tonal distinction in some varieties of Lithuanian, and /uə/ can be the tone-bearing vowel. One may suspect that the variety of transcriptions could be related to tone, but:

it doesn't seem that way, especially since the data shows much more variation,
any proposed links between the quality of the \ vowel and its tone are tenuous at best,
as such, I've never seen them incorporated into an official transcription of anything,
the data in general does not distinguish between the tones.

antecedent commented 4 years ago

/n ~ ŋ/ and /nʲ ~ ŋʲ/ are allophone pairs: velars (ŋ, ŋʲ) before velars (k, g, x, kʲ, gʲ, xʲ), dentals (n, nʲ) elsewhere. The dataset should use either broad or narrow transcription in this case, but instead uses a mixture of both.
Palatalized consonants (Cʲ) and plain ones (C) harmonize locally; that is, a consonant cluster can be either all-palatalized or all-plain. Velars /k g x/ are optional blockers of this harmony, so C + C + g + Cʲ + Cʲ is OK (where the first two Cs are plain, the last two palatalized, and a velar stands in between). The blocking variants are uncommon, though.
/ts tʲsʲ tʃ tʲʃʲ dz dʲzʲ dʒ dʲʒʲ/ occur both as affricates (unisegmental) and clusters (bisegmental). The distinction is subtle and often ignored in transcription. Again, we should either use (3a) affricates everywhere, (3b) clusters everwhere or (3c) actually follow the distribution.
The \ issue; see previous comment.
\ is a unisegmental /iə/, the \ comment holds verbatim for the front counterpart.
/ɾ/ (the tap) phonologically equals /r/ (the trill). Analogously for /ɾʲ ~ rʲ/
Ditto for /ʋ/ and /v/. Analogously for palatalized counterparts.
Tones: \ is /ɡʲ îː s l ɐ/ in lit_dev.tsv, \ is /k ɐ m u ə lʲ ǐː s̪/. The rising/falling diacritics are tones. Most words have them in tonal Lithuanian, but only these and a few others have them transcribed in the dataset. Should probably discard these altogether.
As mentioned, we should really drop the dental diacritic; it doesn't correspond to any phonemic distinction.
/o/ isn't a thing. It's either /oː/, /ɔ/ or a part of /uə/.
[ɫ] is just a non-palatalized (plain) /l/.

Instead of exemplifying the points, I guess I'll just submit a sample correction of the dev-dataset.

kylebgorman commented 4 years ago

Hi Ignas,

That there be transcription inconsistencies (even to this degree) in the data is something we expect (see, e.g., section 4.3 of the WikiPron paper: http://wellformedness.com/papers/lee-etal-2020.pdf).

That said, we would appreciate corrections from subject experts, but for these to be maximally useful, can I suggest they be expressed programmatically (and ideally, as a program that processes the TSV files)? It does us no good to make corrections to, say, train and dev, but not test. And a programmatic fix can also be applied back to Wiktionary itself, so that users of that website---not just participants in this shared task, or users of WikiPron---can benefit from your work.

We are also pursuing programmatic fixes for English at the moment.

On Mon, Mar 16, 2020 at 5:43 AM Ignas Rudaitis notifications@github.com wrote:

/n ~ ŋ/ and /nʲ ~ ŋʲ/ are allophone pairs: velars (ŋ, ŋʲ) before velars (k, g, x, kʲ, gʲ, xʲ), dentals (n, nʲ) elsewhere. The dataset should use either broad or narrow transcription in this case, but instead uses a mixture of both.

Palatalized consonants (Cʲ) and plain ones (C) harmonize locally; that is, a consonant cluster can be either all-palatalized or all-plain. Velars /k g x/ are optional blockers of this harmony, so C + C + g + Cʲ + Cʲ is OK (where the first two Cs are plain, the last two palatalized, and a velar stands in between). The blocking variants are uncommon, though.

/ts tʲsʲ tʃ tʲʃʲ dz dʲzʲ dʒ dʲʒʲ/ occur both as affricates (unisegmental) and clusters (bisegmental). The distinction is subtle and often ignored in transcription. Again, we should either use (3a) affricates everywhere, (3b) clusters everwhere or (3c) actually follow the distribution.

The issue; see previous comment.

is a unisegmental /iə/, the comment holds verbatim for the front counterpart.

/ɾ/ (the tap) phonologically equals /r/ (the trill). Analogously for /ɾʲ ~ rʲ/

Ditto for /ʋ/ and /v/. Analogously for palatalized counterparts.

Tones: is /ɡʲ îː s l ɐ/ in lit_dev.tsv, is /k ɐ m u ə lʲ ǐː s̪/. The rising/falling diacritics are tones. Most words have them in tonal Lithuanian, but only these and a few others have them transcribed in the dataset. Should probably discard these altogether.

As mentioned, we should really drop the dental diacritic; it doesn't correspond to any phonemic distinction.

/o/ isn't a thing. It's either /oː/, /ɔ/ or a part of /uə/.

[ɫ] is just a non-palatalized (plain) /l/.

Instead of exemplifying the points, I guess I'll just submit a sample correction of the dev-dataset.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sigmorphon/2020/issues/2#issuecomment-599438230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OKQI7DF7RP5JHXJFPDRHXYDPANCNFSM4K7BQCOQ .

antecedent commented 4 years ago

I found that simply redoing the transcription from the spelling (basically running a rule-based grapheme-to-phoneme transducer) works best. There isn't a lot of orthographic depth, so even without a lexicon, the ambiguities that arise are few. The vast majority of them can be filled in from the Wiktionary transcriptions. This script (Python 3) attempts to do both the transduction and the disambiguation. Currently, the disambiguation step fails on about 3% of the entries, which the script "handles" by omitting them in the output.

For now, though, there's another issue holding this back from production use, which is indiscriminately transcribing monophthongal \ as /oː/. Some loanwords, especially Latinate, other (neo-)Classical, and English, should have /ɔ/ instead. This distinction could possibly be filled in from the Wiktionary data as well, but a lexicon might turn out to be necessary too.

By the way, the WER for the FST baseline dropped 3-fold once I had run the script.

antecedent commented 4 years ago

I've updated the script to tell apart the \s, aside from other fixes. Aligning with the original transcriptions (the current strategy) yields only mediocre accuracy for the the \ problem, but at least the script doesn't introduce additional errors or inconsistencies now. The set of all errors that remain is a strict subset of the original errors, judging by a diff on the train and dev data. The 3% turnover rate remains, which I believe can be compensated by scraping more entries. The WER for the particular baseline model is still almost as low as when the \s had been completely leveled.

Please consider applying this script or any better strategy that you might conceive of.

kylebgorman commented 4 years ago

Thanks @antecedent, this is great work. I think these changes should take place upstream, to Wiktionary itself, and we can port them downstream if time allows.

To do this I would take the data scraped by WikiPron here and create a list of corrections (with wordform, corrected pronuncition pairs---leave out the entries not affected). Then, these can be uploaded en masse (I assume) to Wiktionary. (Perhaps a Wiktionarian can help there---there must be some kind of API for that.) You also might want to start a discussion at the Wiktionary appendix on Lithuanian pronunciation.

I would like to know how many entries are affected and whether they would require further manual vetting.

@lfashby @jacksonllee thoughts?

lfashby commented 4 years ago

I agree with Kyle's suggestions and would add that if the Wiktionarians like the changes suggested by @antecedent's script, we should suggest that they attempt to add your script (translated into Lua) as a module for generating Lithuanian pronunciations from spellings. Many Wiktionary languages contain such modules, here is one for Serbo-Croatian. There was an attempt to build a pronunciation module for Lithuanian, but it does not look like it is being used currently.

antecedent commented 4 years ago

Thank you. I think this can be done.

As for the manual vetting, I guess it would be prudent to proofread the \s, after all. Given that we have an exhaustive list of candidates (that is, the full WikiPron scrape) for the Wiktionary changes now, and it is not a very long one, it would be probably easiest to mark the relevant /ɔ/-loanword lemmas manually.

Another minor nuisance is that we would have to put the stress marks and syllable boundaries back in. I just ran a separate scrape that includes them; now, another script will be needed to reconcile the transcriptions in this respect.

Once I have that, I will provide the exact counts, but they seem to be large. Unfortunately, there seem to have been several independent initiatives to introduce IPA transcriptions for Lithuanian to Wiktionary, and they decided on many things differently.

kylebgorman commented 4 years ago

WikiPron can scrape entries with stress marks and syllable boundaries:

$ wikipron --phonetic --no-segment lit

We just don't distribute the scraped data with stress and syllable boundaries because in many languages, these are only marked occasionally and we haven't taken the time to vet all 167 yet.

If you just modify your script so that it passes through that information (and doesn't require segmentation), you should be good to go.

If you like we can move this issue over to the WikiPron repo. We are doing a similar improvement projects for English.

aryamanarora commented 3 years ago

I'm working on upstream improvements to Lithuanian on Wiktionary by improving the pronunciation module. It would be very useful if a native speaker (@Nofenigma @antecedent ?) could add testcases to the testcase list: https://en.wiktionary.org/wiki/Module:lt-pron/testcases. This way we can catch any tricky cases and improve the rule-based system as much as possible, which can then be deployed to every Lithuanian page and replace the ad-hoc transcriptions.

gailius-r commented 1 year ago

You can generate or check the validity of Lithuanian IPA transcriptions using this online service: https://kalbu.vdu.lt/mokymosi-priemones/tartis/#fonetinis-transkribuoklis

This tool is rule-based and was developed jointly by linguists and computer scientists. It has been extensively tested.