Open evgenykochetkov opened 1 year ago
I believe the latest prosodic fixes this; it's using a smarter orthographic syllabifier now.
In [1]: import prosodic as p
In [2]: from_first_sonnet = p.Text("thereby beauty's self-substantial cruel within niggarding glutton")
⎾ building text with 7 words @ 2024-08-01 09:29:45,150
│ ⎾ tokenizing @ 2024-08-01 09:29:45,150
│ ⎿ 0 seconds @ 2024-08-01 09:29:45,159
│ ⎾ building stanzas @ 2024-08-01 09:29:45,159
│ │ iterating stanzas: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.72s/it]
│ ⎿ 1.7 seconds @ 2024-08-01 09:29:46,890
⎿ 1.7 seconds @ 2024-08-01 09:29:46,890
In [3]: for x in from_first_sonnet.syllables: print(x)
Syllable(ipa="'ðɛr", num=1, txt='the', is_stressed=True, is_heavy=True, is_weak=False)
Syllable(ipa="'baɪ", num=2, txt='reby', is_stressed=True, is_heavy=True, is_weak=False)
Syllable(ipa="'bjuː", num=1, txt='bea', is_stressed=True, is_heavy=True, is_strong=True, is_weak=False)
Syllable(ipa='tiz', num=2, txt="uty's", is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'sɛlf", num=1, txt='self', is_stressed=True, is_heavy=True)
Syllable(ipa='sʌb', num=1, txt='subs', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'stæn", num=2, txt='tan', is_stressed=True, is_heavy=True, is_strong=True, is_weak=False)
Syllable(ipa='ʧəl', num=3, txt='tial', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'kruː", num=1, txt='cr', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='əl', num=2, txt='uel', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa='wɪ', num=1, txt='wit', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Syllable(ipa="'ðɪn", num=2, txt='hin', is_stressed=True, is_heavy=True, is_strong=True, is_weak=False)
Syllable(ipa="'nɪ", num=1, txt='nig', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='ɡə', num=2, txt='gar', is_stressed=False, is_heavy=False, is_strong=False, is_weak=True)
Syllable(ipa='dɪŋ', num=3, txt='ding', is_stressed=False, is_heavy=True, is_strong=False)
Syllable(ipa="'ɡlʌ", num=1, txt='glut', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='tən', num=2, txt='ton', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
Let me know if you find any other issues, though! Closing for now but feel free to re-open or comment.
The new syllabifier handled "self-substantial" way better, that's a huge improvement!
But some mismatches between ipa
and txt
still remain:
Also, https://github.com/quadrismegistus/prosodic/issues/34#issuecomment-1048374098 describes the same problem
Thanks for pointing this out. The issue is that the phonetic/IPA syllabifier (based on CMU pronunciation dictionary + espeak TTS for unknown words' IPA pronunciation + syllabiphon for detecting syllable boundaries within the espeak IPA output) – and the orthographic/text syllabifier (from nltk.tokenize.SyllableTokenizer) – are completely different systems. The former is much more accurate than the latter – many orthographic syllabifiers out there are just aimed at finding the right place to put a hyphen when a word needs to break between lines.
I wonder if we ought to try use the former to guide the latter by massaging the orthographic syllable boundaries with the IPA output. For instance, the "s" of "subs" in self-_subs_tantial, in the orthographic output, could be moved to the next syllable (self-sub_stan_tial) given that in the IPA syllabification the "s" is in that latter syllable. Letters and phonemes don't exactly match, but (in English at least, and actually even more in Finnish I believe) they might match enough for this task.
Let me know what you think. PRs are welcome if you want to give this a try! Otherwise I'll see if I can come back to this in a week or so.
Here are some examples from Shakespeare's sonnet 1: