sigmorphon / 2020

SIGMORPHON 2020 Shared Task: Grapheme-to-Phoneme, Unsupervised Induction of Morphology, and Typologically Diverse Morphological Inflection
35 stars 12 forks source link

Inconsistent Bulgarian transcriptions #9

Open bpopeters opened 4 years ago

bpopeters commented 4 years ago

Similarly to the recent discussion of Georgian, I have noticed various inconsistencies in the Bulgarian data.

  1. Alveolar vs. dental t/d: [d], [d̪], [t], and [t̪] all occur in the data. There are some near-minimal pairs like туй -> [t̪ u j] and тук -> [t u k]. My Bulgarian-speaking consultant does not think the alveolar-dental pairs are allophones, and at any rate their usage seems to be random.
  2. Affricates: both joined forms (like t͡s) and separated ones (like t s) occur in the data.
  3. Light vs. dark L: according to this reference, [l] and [ɫ] are allophones, with [l] before front vowels and [ɫ] elsewhere. However, their occurrence in the data appears to be basically random.
kylebgorman commented 4 years ago

Could you move this over the WikiPron issue tracker as well (cf. https://github.com/kylebgorman/wikipron/issues/138) and follow the procedure I suggested there?

  1. We need to pick one as the UR. Does your informant have any opinions on the subject?
  2. Affricates should be done with the tie, I agree.
  3. Let's assume light /l/ is the UR.

If these consistencies can't be resolved in time, either they're truly random noise that will affect all participants equally or there is some consistencies that systems will glom onto.

In the worst case we'll put out a post-competition revision of the data set after the fact.

It is fascinating to me how Phoible continues to be unhelpful in resolving these issues: https://phoible.org/languages/bulg1262.

On Tue, Apr 14, 2020 at 1:27 PM Ben Peters notifications@github.com wrote:

Similarly to the recent discussion of Georgian, I have noticed various inconsistencies in the Bulgarian data.

  1. Alveolar vs. dental t/d: [d], [d̪], [t], and [t̪] all occur in the data. There are some near-minimal pairs like туй -> [t̪ u j] and тук -> [t u k]. My Bulgarian-speaking consultant does not think the alveolar-dental pairs are allophones, and at any rate their usage seems to be random.
  2. Affricates: both joined forms (like t͡s) and separated ones (like t s) occur in the data.
  3. Light vs. dark L: according to this reference http://www.personal.rdg.ac.uk/~llsroach/phon2/b_phon/b_phon.htm, [l] and [ɫ] are allophones, with [l] before front vowels and [ɫ] elsewhere. However, their occurrence in the data appears to be basically random.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sigmorphon/2020/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OPMFZ4N7DTQWPXIV7DRMSMHRANCNFSM4MH5KNHA .

bpopeters commented 4 years ago

Thanks, I'll open an issue on WikiPron.

My informant says it's alveodental. The reference I linked above labels them as dental, as does (Klagstad, 1958).

besou commented 4 years ago

Affricates: both joined forms (like t͡s) and separated ones (like t s) occur in the data.

Be careful with t + s: in some cases, it can also appear at morpheme boundaries where it represents two phonemes (written as тс in Bulgarian orthography). Examples include хърватски h ə r v ɑ t s k i and отсъстваха o t̪ s ɤ s t̪ v ə x ə. When t + s corresponds to the letter ц, it is monophonemic and should probably better be transcribed as t͡s. It is true that this is not applied consistently in the data. Examples for the monophonemic ц from the training data include абстракция a p s t r a k t͡s i j ə, болница b ɔ l n i t͡s ə, администрация ə d m i n i s t r a t s i j ə and акцент ə k t s ɛ n t.

Please also note that, interestingly, the letter ч is never represented as t͡ʃ in the training data, but always as t ʃ, although it is clearly a single phoneme. Examples include вечер v ɛ t ʃ ɛ r and чужденец t ʃ u ʒ d ɛ n ɛ t s. I suppose this might be because, unlike with t + s, it does not contrast with a biphonemic sequence t + ʃ (i.e. тш in Bulgarian orthography), because that is not a typical phoneme sequence in Bulgarian and does not appear even once in the training data.

Light vs. dark L: according to this reference, [l] and [ɫ] are allophones, with [l] before front vowels and [ɫ] elsewhere. However, their occurrence in the data appears to be basically random.

Regarding /l/, the situation is also not so easy. /l/ has to allophones [ɫ] and [l]. In the syllable onset, [ɫ] is used before /u/, /ɔ/, /a/, and /ɤ/, and [l] is used before /i/ and /ɛ/. In addition, there is a palatal /lʲ/ which is a phoneme on its own, and which can appear in the syllable onset before /u/, /ɔ/, and /a/. In the syllable coda, one would typically expect to find [ɫ] according to the rule mentioned in the initial post. However, in some cases, one finds [l] consistently in the coda, and these are words which have a palatal /lʲ/ in other Slavic languages. Examples include болница b ɔ l n i t͡s ə (compare to Russian больница) and актуалност ə k t u a l n o s t (compare to Russian актуальность) and писател p i s a t ɛ l (compare to Russian писатель). To conclude, it seems that while the phoneme /l/ has the two allophones [ɫ] and [l], the phoneme /lʲ/ has also two allophones, namely [lʲ] and [l]. The phone [l] is thus an allophone of two different phonemes: it is an allophone of /l/ in syllable onsets, but an allophone of /lʲ/ in syllable codas. Since the assumed phonemic distinction between /l/ ([ɫ]) and /lʲ/ ([l]) in syllable codas is not reflected in writing, it cannot be easily checked whether the data is consistent here.

kylebgorman commented 4 years ago

Affricates: both joined forms (like t͡s) and separated ones (like t s) occur in the data.

Be careful with t + s: in some cases, it can also appear at morpheme boundaries where it represents two phonemes (written as тс in Bulgarian orthography). Examples include хърватски h ə r v ɑ t s k i and отсъстваха o t̪ s ɤ s t̪ v ə x ə. When t + s corresponds to the letter ц, it is monophonemic and should probably better be transcribed as t͡s. It is true that this is not applied consistently in the data. Examples for the monophonemic ц from the training data include абстракция a p s t r a k t͡s i j ə, болница b ɔ l n i t͡s ə, администрация ə d m i n i s t r a t s i j ə and акцент ə k t s ɛ n t.

Please also note that, interestingly, the letter ч is never represented as t͡ʃ in the training data, but always as t ʃ, although it is clearly a single phoneme. Examples include вечер v ɛ t ʃ ɛ r and чужденец t ʃ u ʒ d ɛ n ɛ t s. I suppose this might be because, unlike with t + s, it does not contrast with a biphonemic sequence t + ʃ (i.e. тш in Bulgarian orthography), because that is not a typical phoneme sequence in Bulgarian and does not appear even once in the training data.

Sounds like manual human intervention will be required here. This may be out of the scope for the task.

Light vs. dark L: according to this reference, [l] and [ɫ] are allophones, with [l] before front vowels and [ɫ] elsewhere. However, their occurrence in the data appears to be basically random.

Regarding /l/, the situation is also not so easy. /l/ has to allophones [ɫ] and [l]. In the syllable onset, [ɫ] is used before /u/, /ɔ/, /a/, and /ɤ/, and [l] is used before /i/ and /ɛ/. In addition, there is a palatal /lʲ/ which is a phoneme on its own, and which can appear in the syllable onset before /u/, /ɔ/, and /a/. In the syllable coda, one would typically expect to find [ɫ] according to the rule mentioned in the initial post. However, in some cases, one finds [l] consistently in the coda, and these are words which have a palatal /lʲ/ in other Slavic languages. Examples include болница b ɔ l n i t͡s ə (compare to Russian больница https://en.wiktionary.org/wiki/%D0%B1%D0%BE%D0%BB%D1%8C%D0%BD%D0%B8%D1%86%D0%B0) and актуалност ə k t u a l n o s t (compare to Russian актуальность https://en.wiktionary.org/wiki/%D0%B0%D0%BA%D1%82%D1%83%D0%B0%D0%BB%D1%8C%D0%BD%D0%BE%D1%81%D1%82%D1%8C) and писател p i s a t ɛ l (compare to Russian писатель https://en.wiktionary.org/wiki/%D0%BF%D0%B8%D1%81%D0%B0%D1%82%D0%B5%D0%BB%D1%8C). To conclude, it seems that while the phoneme /l/ has the two allophones [ɫ] and [l], the phoneme /lʲ/ has also two allophones, namely [lʲ] and [l]. The phone [l] is thus an allophone of two different phonemes: it is an allophone of /l/ in syllable onsets, but an allophone of /lʲ/ in syllable codas. Since the assumed phonemic distinction between /l/ ([ɫ]) and /lʲ/ ([l]) in syllable codas is not reflected in writing, it cannot be easily checked whether the data is consistent here.

FWIW we have no desire to transcribe allophony for this task, but I do see the issue with the conditional merger into [l] in syllable codas.

Thank you for the reports on both issues.