here is the Tibetan input: "ཨ་ར།"
What seems to happen is that when updating with the new content, the first token gets deleted here, yet when segmenting the new content here, no new content is given to the tokenizer. Instead, the content of the first remaining token is given to be resegmented.
This is how we end up with two punctuation tokens in the end.
here is the Tibetan input: "ཨ་ར།" What seems to happen is that when updating with the new content, the first token gets deleted here, yet when segmenting the new content here, no new content is given to the tokenizer. Instead, the content of the first remaining token is given to be resegmented.
This is how we end up with two punctuation tokens in the end.