marytts / marytts-lexicon-pl

Polish lexicon for MaryTTS
GNU Lesser General Public License v3.0
2 stars 5 forks source link

IPA mapping for Polish lexicon? #5

Closed brawer closed 6 years ago

brawer commented 6 years ago

What is the correct mapping from allophones in the MaryTTS Polish lexicon to IPA?

For Unicode’s Unilex project, I’m trying to convert the Polish pronunciation dictionary from MaryTTS to the International Phonetic Alphabet. (Thanks again for collaborating with Unicode, you really have nice data!) In Polish, it seems MaryTTS uses a custom phonetic notation which seems to be mostly like X-SAMPA, but with custom modifications. For example, the X-SAMPA symbol y would be y in IPA, which isn’t a Polish vowel; probably MaryTTS uses y for the IPA vowel ɨ but I’m not fully sure.

Here’s a first attempt for a mapping table. If you could kindly tell me the correct mappings (especially for the rows with *****), I’ll gladly volunteer to add an ipa attribute to allophones.pl.xml similar to the pull requests I’ve sent you for Luxembourgish, German and French. Perhaps more importantly, we can then import your data into Unicode’s lexicon.

MaryTTS Count IPA
. 35427 .
o 8487 o
a 8429 a
e 7937 e
n 5625 n
r 5599 r
v 5382 v
t 5187 t
j 5067 j
aa 4850 *****
p 4687 p
s 4286 s
y 4046 *****
oo 3999 *****
m 3964 m
k 3783 k
ee 3712 *****
i 3347 i
d 3296 d
u 3226 u
l 3183 l
ni 2978 *****
w 2733 w
z 2585 z
g 2101 ɡ
Sz 2005 *****
ts 1988 t͡s
f 1883 f
b 1773 b
x 1580 x
ii 1527 *****
tSz 1501 *****
yy 1331 *****
uu 1322 *****
tsi 1273 *****
rZ 1165 *****
w_ 1158 *****
dzi 1004 *****
si 976 *****
c 772 *****
dz 421 d͡z
N_ 326 *****
zi 219 *****
gi 143 *****
j_ 109 *****
drZ 39 *****
psibre commented 6 years ago

@brawer The lexicon was assembled from resources provided by @jolabachan. I have to admit I'm not sure about the relation to IPA -- maybe @jolabachan can help?

jolabachan commented 6 years ago

Hi! I wasn't the creator of the notation, but I will do my best. The double syllables are accented syllables. We don't have long and short syllables in Polish, but in this notation this is how they were marked. The "." means the end of a syllable. Here are some examples: kazał => k aa . z a w kazała => k a . z aa . w a krzyczeć => k Sz yy . tSz e tsi These are not always long vowels, but they are more prominent than the unaccented ones.

MaryTTS Count IPA
. 35427 .
o 8487 o
a 8429 a
e 7937 e
n 5625 n
r 5599 r
v 5382 v
t 5187 t
j 5067 j
aa 4850 ***** a
p 4687 p
s 4286 s
y 4046 ***** ɨ
oo 3999 ***** o
m 3964 m
k 3783 k
ee 3712 ***** e
i 3347 i
d 3296 d
u 3226 u
l 3183 l
ni 2978 ***** ɲ
w 2733 w
z 2585 z
g 2101 ɡ
Sz 2005 ***** ʃ
ts 1988 t͡s
f 1883 f
b 1773 b
x 1580 x
ii 1527 ***** i
tSz 1501 ***** t͡ʃ
yy 1331 ***** ɨ
uu 1322 ***** u
tsi 1273 ***** t͡ɕ
rZ 1165 ***** ʒ
w_ 1158 ***** w̃
dzi 1004 ***** d͡ʑ
si 976 ***** ɕ
c 772 ***** c [as in word 'kitel', palatalised /k/]
dz 421 d͡z
N_ 326 ***** ŋ
zi 219 ***** ʑ
gi 143 ***** ɟ [as in word 'gips', palatalised /g/]
j_ 109 ***** j̃
drZ 39 ***** d͡ʒ

Good luck! :-)

psibre commented 6 years ago

@jolabachan Thanks for the helpful clarifications!