Simplified Chinese model does not detect sentence boundaries correctly

segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.

MIT License

667 stars 39 forks source link

Hi,

I have tried Simplified Chinese model on demo page and it seems that sentence boundary and tokens detection are not correct.

I have 2 ideas why that could happen:

Period in Chinese is 。
There are no white spaces between words. Possibly it is better to use something like https://github.com/voidism/pywordseg to split on words as a preprocessing step

It looks like issue 2. causes that tokens are also not detected correctly. I have compared with https://github.com/voidism/pywordseg results and they do not match. But I am not sure here, because I have compared Spacy, pywordseg and Stanford Word Segmenter and all of them provide different results

segment-any-text / wtpsplit

Simplified Chinese model does not detect sentence boundaries correctly #51