Retain linguistic information after tokenize()

Suggesting a feature to retain linguistic information of bound morphemes (e.g., =s, =ing, =ed, =le) or ellipsis. This will help differentiate between truncated words/sounds and bound morphemes in lexicon after tokenize().

For example: le... [stuttering] let's see jiejie=s [plural morpheme] are playing with the s... [stuttering] silly bottle=le [Tamil =le morpheme added] game -> jiejie s were playing with the s silly bottle le game

In the above case, the output from bela.tokenize() does not allow us to differentiate between whether a lexical entry is a speech error or a bound morpheme.

letuananh / bela

Retain linguistic information after tokenize() #4