segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
667 stars 39 forks source link

Simplified Chinese model does not detect sentence boundaries correctly #51

Closed marlon-br closed 1 year ago

marlon-br commented 3 years ago

Hi,

I have tried Simplified Chinese model on demo page and it seems that sentence boundary and tokens detection are not correct.

I have 2 ideas why that could happen:

  1. Period in Chinese is 。
  2. There are no white spaces between words. Possibly it is better to use something like https://github.com/voidism/pywordseg to split on words as a preprocessing step

It looks like issue 2. causes that tokens are also not detected correctly. I have compared with https://github.com/voidism/pywordseg results and they do not match. But I am not sure here, because I have compared Spacy, pywordseg and Stanford Word Segmenter and all of them provide different results

bminixhofer commented 1 year ago

Hi! Sorry for being so quiet on this library. I have been working on a major revamp, expanding support to 85 languages, switching to a new training objective without labelled data, and switching the backbone to a BERT-style model.

Chinese should work well now, there is quantitative evaluation in our paper. FYI there is no online demo right now but I'm working on one.