quadrismegistus / prosodic

Prosodic: a metrical-phonological parser, written in Python. For English and Finnish, with flexible language support.
http://quadrismegistus.github.io/prosodic/
GNU General Public License v3.0
277 stars 43 forks source link

adding language: Esperanto #36

Open niru86 opened 2 years ago

niru86 commented 2 years ago

I'm trying to adapt prosodic to Esperanto: its stress is always paroxytonic abelo (en. bee) [a.'be.lo] but in poetry there can be elision and the word would become oxytonic abel'

Esperanto is as phonematic as Finnish, so I decided to use the orth feature, but I'm puzzled in LANG_stress.py because I don't understand its code :( Could you help me? I want to use prosodic for my MA research.

quadrismegistus commented 4 months ago

Hi there, are you still interested in working on this? I'd be happy to collaborate if you're still interested.

We can model Esperanto parser on the FinnishLanguage object in finnish.py:

class FinnishLanguage(Language):
    pronunciation_dictionary_filename = os.path.join(PATH_DICTS,'en','english.tsv')
    lang = 'fi'
    cache_fn = 'finnish_wordtypes'

    @cache
    def get(self, token):
        token=token.strip()
        Annotation = make_annotation(token)
        syllables=[]
        wordbroken=False
        for ij in range(len(Annotation.syllables)):
            try:
                sylldat=Annotation.split_sylls[ij]
            except IndexError:
                sylldat=["","",""]

            syllStr=""
            onsetStr=sylldat[0].strip().replace("'","").lower()
            nucleusStr=sylldat[1].strip().replace("'","").lower()
            codaStr=sylldat[2].strip().replace("'","").lower()

            for x in [onsetStr,nucleusStr,codaStr]:
                x=x.strip()
                if not x: continue
                if (not x in orth2phon):
                    for y in x:
                        y=y.strip()
                        if not y: continue
                        if (not y in orth2phon):
                            wordbroken=True
                        else:
                            syllStr+="".join(orth2phon[y])
                else:
                    syllStr+="".join(orth2phon[x])
            syllables.append(syllStr)

        wordforms=[]
        sylls_text=[syll for syll in Annotation.syllables]
        for stress in Annotation.stresses:
            sylls_ipa = [stress2stroke[stress[i]]+syllables[i] for i in range(len(syllables))]
            wf=WordForm(
                token, 
                sylls_ipa=sylls_ipa, 
                sylls_text=sylls_text,
            )
            wordforms.append(wf)
        wordtype = WordType(token, children=wordforms, lang=self.lang)
        return wordtype

All we need is a .get(token) method that can take an arbitrary word string and return a WordType object composed of the syllabified data (phonemes + orthography).

It then should work like this:

In [10]: from prosodic.langs.finnish import Finnish

In [11]: word = Finnish().get('kalevala')

In [12]: for syll in word.syllables:
    ...:     print(syll)
    ...: 
Syllable(ipa="'kɑ", num=1, txt='ka', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='le', num=2, txt='le', is_stressed=False, is_heavy=False, is_strong=False, is_weak=True)
Syllable(ipa='`vɑ', num=3, txt='va', is_stressed=True, is_heavy=False, is_strong=True, is_weak=False)
Syllable(ipa='lɑ', num=4, txt='la', is_stressed=False, is_heavy=False, is_strong=False, is_weak=True)

Let me know if you have thoughts. It's great that Esperanto is rule-based in its stress: seems doable to incorporate!