Syllable `token` doesn't always match syllable phonemes

evgenykochetkov commented 1 year ago

Here are some examples from Shakespeare's sonnet 1:

>>> import prosodic as p
>>> from_first_sonnet = p.Text("thereby beauty's self-substantial cruel within niggarding glutton")
>>> for w in from_first_sonnet.ents(cls='Word'): print(w.children)
[<Syllable.the> ['ðɛr], <Syllable.reby> ['baɪ]]
[<Syllable.bea> ['bjʉː], <Syllable.uty's> [tɪz]]
[<Syllable.self-self> ['sɛlf], <Syllable.-> [sʌb], <Syllable.subs> ['stæn], <Syllable.tantial> [ʃʌl]]
[<Syllable.cr> ['kruː], <Syllable.uel> [əl]]
[<Syllable.wit> [wɪ], <Syllable.hin> ['ðɪn]]
[<Syllable.nig> ['nɪ], <Syllable.gar> [ɡʌ], <Syllable.ding> [dɪŋ]]
[<Syllable.glut> ['ɡlʌ], <Syllable.ton> [tʌn]]

quadrismegistus commented 3 months ago

I believe the latest prosodic fixes this; it's using a smarter orthographic syllabifier now.

In [1]: import prosodic as p

In [2]: from_first_sonnet = p.Text("thereby beauty's self-substantial cruel within niggarding glutton")
⎾ building text with 7 words @ 2024-08-01 09:29:45,150
￨ ⎾ tokenizing @ 2024-08-01 09:29:45,150
￨ ⎿ 0 seconds @ 2024-08-01 09:29:45,159
￨ ⎾ building stanzas @ 2024-08-01 09:29:45,159
￨ ￨ iterating stanzas: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.72s/it]
￨ ⎿ 1.7 seconds @ 2024-08-01 09:29:46,890
⎿ 1.7 seconds @ 2024-08-01 09:29:46,890

In [3]: for x in from_first_sonnet.syllables: print(x)
Syllable(ipa="'ðɛr", num=1, txt='the', is_stressed=True, is_heavy=True, is_weak=False)
Syllable(ipa="'baɪ", num=2, txt='reby', is_stressed=True, is_heavy=True, is_weak=False)
Syllable(ipa="'bjuː", num=1, txt='bea', is_stressed=True, is_heavy=True, is_strong=True, is_weak=False)
Syllable(ipa='tiz', num=2, txt="uty's", is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'sɛlf", num=1, txt='self', is_stressed=True, is_heavy=True)
Syllable(ipa='sʌb', num=1, txt='subs', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'stæn", num=2, txt='tan', is_stressed=True, is_heavy=True, is_strong=True, is_weak=False)
Syllable(ipa='ʧəl', num=3, txt='tial', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'kruː", num=1, txt='cr', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='əl', num=2, txt='uel', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa='wɪ', num=1, txt='wit', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'ðɪn", num=2, txt='hin', is_stressed=True, is_heavy=True, is_strong=True, is_weak=False)
Syllable(ipa="'nɪ", num=1, txt='nig', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='ɡə', num=2, txt='gar', is_stressed=False, is_heavy=False, is_strong=False, is_weak=True)
Syllable(ipa='dɪŋ', num=3, txt='ding', is_stressed=False, is_heavy=True, is_strong=False)
Syllable(ipa="'ɡlʌ", num=1, txt='glut', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='tən', num=2, txt='ton', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)

Let me know if you find any other issues, though! Closing for now but feel free to re-open or comment.

evgenykochetkov commented 2 months ago

The new syllabifier handled "self-substantial" way better, that's a huge improvement!

But some mismatches between ipa and txt still remain:

evgenykochetkov commented 2 months ago

Also, https://github.com/quadrismegistus/prosodic/issues/34#issuecomment-1048374098 describes the same problem

quadrismegistus commented 2 months ago

Thanks for pointing this out. The issue is that the phonetic/IPA syllabifier (based on CMU pronunciation dictionary + espeak TTS for unknown words' IPA pronunciation + syllabiphon for detecting syllable boundaries within the espeak IPA output) – and the orthographic/text syllabifier (from nltk.tokenize.SyllableTokenizer) – are completely different systems. The former is much more accurate than the latter – many orthographic syllabifiers out there are just aimed at finding the right place to put a hyphen when a word needs to break between lines.

I wonder if we ought to try use the former to guide the latter by massaging the orthographic syllable boundaries with the IPA output. For instance, the "s" of "subs" in self-_subs_tantial, in the orthographic output, could be moved to the next syllable (self-sub_stan_tial) given that in the IPA syllabification the "s" is in that latter syllable. Letters and phonemes don't exactly match, but (in English at least, and actually even more in Finnish I believe) they might match enough for this task.

Let me know what you think. PRs are welcome if you want to give this a try! Otherwise I'll see if I can come back to this in a week or so.

quadrismegistus / prosodic

Syllable `token` doesn't always match syllable phonemes #47