Closed brawer closed 6 years ago
@brawer The lexicon was assembled from resources provided by @jolabachan. I have to admit I'm not sure about the relation to IPA -- maybe @jolabachan can help?
Hi! I wasn't the creator of the notation, but I will do my best. The double syllables are accented syllables. We don't have long and short syllables in Polish, but in this notation this is how they were marked. The "." means the end of a syllable. Here are some examples: kazał => k aa . z a w kazała => k a . z aa . w a krzyczeć => k Sz yy . tSz e tsi These are not always long vowels, but they are more prominent than the unaccented ones.
MaryTTS | Count | IPA |
---|---|---|
. | 35427 | . |
o | 8487 | o |
a | 8429 | a |
e | 7937 | e |
n | 5625 | n |
r | 5599 | r |
v | 5382 | v |
t | 5187 | t |
j | 5067 | j |
aa | 4850 | ***** a |
p | 4687 | p |
s | 4286 | s |
y | 4046 | ***** ɨ |
oo | 3999 | ***** o |
m | 3964 | m |
k | 3783 | k |
ee | 3712 | ***** e |
i | 3347 | i |
d | 3296 | d |
u | 3226 | u |
l | 3183 | l |
ni | 2978 | ***** ɲ |
w | 2733 | w |
z | 2585 | z |
g | 2101 | ɡ |
Sz | 2005 | ***** ʃ |
ts | 1988 | t͡s |
f | 1883 | f |
b | 1773 | b |
x | 1580 | x |
ii | 1527 | ***** i |
tSz | 1501 | ***** t͡ʃ |
yy | 1331 | ***** ɨ |
uu | 1322 | ***** u |
tsi | 1273 | ***** t͡ɕ |
rZ | 1165 | ***** ʒ |
w_ | 1158 | ***** w̃ |
dzi | 1004 | ***** d͡ʑ |
si | 976 | ***** ɕ |
c | 772 | ***** c [as in word 'kitel', palatalised /k/] |
dz | 421 | d͡z |
N_ | 326 | ***** ŋ |
zi | 219 | ***** ʑ |
gi | 143 | ***** ɟ [as in word 'gips', palatalised /g/] |
j_ | 109 | ***** j̃ |
drZ | 39 | ***** d͡ʒ |
Good luck! :-)
@jolabachan Thanks for the helpful clarifications!
What is the correct mapping from allophones in the MaryTTS Polish lexicon to IPA?
For Unicode’s Unilex project, I’m trying to convert the Polish pronunciation dictionary from MaryTTS to the International Phonetic Alphabet. (Thanks again for collaborating with Unicode, you really have nice data!) In Polish, it seems MaryTTS uses a custom phonetic notation which seems to be mostly like X-SAMPA, but with custom modifications. For example, the X-SAMPA symbol
y
would bey
in IPA, which isn’t a Polish vowel; probably MaryTTS usesy
for the IPA vowelɨ
but I’m not fully sure.Here’s a first attempt for a mapping table. If you could kindly tell me the correct mappings (especially for the rows with *****), I’ll gladly volunteer to add an
ipa
attribute toallophones.pl.xml
similar to the pull requests I’ve sent you for Luxembourgish, German and French. Perhaps more importantly, we can then import your data into Unicode’s lexicon..
o
a
e
n
r
v
t
j
aa
p
s
y
oo
m
k
ee
i
d
u
l
ni
w
z
g
Sz
ts
f
b
x
ii
tSz
yy
uu
tsi
rZ
w_
dzi
si
c
dz
N_
zi
gi
j_
drZ