letuananh / bela

👸 BELA - A pathway for creating and analysing multi-lingual transcripts using BELA convention and ELAN software
MIT License
3 stars 2 forks source link

Retain linguistic information after tokenize() #4

Open vicchuayh opened 1 year ago

vicchuayh commented 1 year ago

Suggesting a feature to retain linguistic information of bound morphemes (e.g., =s, =ing, =ed, =le) or ellipsis. This will help differentiate between truncated words/sounds and bound morphemes in lexicon after tokenize().

For example: le... [stuttering] let's see jiejie=s [plural morpheme] are playing with the s... [stuttering] silly bottle=le [Tamil =le morpheme added] game -> jiejie s were playing with the s silly bottle le game

In the above case, the output from bela.tokenize() does not allow us to differentiate between whether a lexical entry is a speech error or a bound morpheme.