Open Kristinita opened 3 years ago
Use a regex to break the input into chunks separated by punctuation, then segment each chunk and combine the results by punctuation. The punctuation adds meaningful segmentation hints so stripping it out will reduce the quality. Segmentation works best on smaller phrases anyway.
The strategy also applies to capitalization.
1. Summary
It would be nice, if WordSegment at least at CLI mode will have the option to preserve all punctuation marks:
.
,,
,’
and so on.2. Problem
Try copy and paste text from these article and book.
The article:
The book:
Yes, ideally, of course, it would be nice normally add a text layer to the PDF, but I’m not making these articles and books. From my experience, I can say that a text layer without spaces like this is a common problem. The routine work of separating words can be time-consuming.
3. Behavior
3.1. Current
CLI usage:
Punctuation marks are stripped. Users have to do a lot of routine work to get them back.
3.2. Expected behavior
Ordinary English texts:
Thanks.