wrong content resegmented

nuwainfo / tibetaneditor

Working repository for a simple Tibetan editor with a segmenter, spellchecker, rule editor, concordancer and more.

2 stars 3 forks source link

wrong content resegmented #18

Open drupchen opened 6 years ago

drupchen commented 6 years ago

bug

here is the Tibetan input: "ཨ་ར།" What seems to happen is that when updating with the new content, the first token gets deleted here, yet when segmenting the new content here, no new content is given to the tokenizer. Instead, the content of the first remaining token is given to be resegmented.

This is how we end up with two punctuation tokens in the end.