tc39 / proposal-intl-segmenter

Unicode text segmentation for ECMAScript
https://tc39.github.io/proposal-intl-segmenter/
146 stars 16 forks source link

Punctuation in the word segmenter #137

Open my2iu opened 3 years ago

my2iu commented 3 years ago

I’m trying to build a line breaking algorithm on top of the word segmenter, so that I can lay out some text in paragraphs in svg. The current Intl.Segmenter seems to put words and punctuation into different segments, so that “Who? Why?” becomes 5 segments: “who,” “?,” “ ,” “why,” and “?”

When laying out text in paragraphs, I usually want the punctuation to stay glued to the nearest word, but there isn’t enough information coming back from the segmenter to do this. It might be nice if the segmenter had an option to include punctuation with words during segmentation, or if the segment iterator returned additional information beyond “isWordLike.” Perhaps “isWhitespace” and/or “isPunctuation” would be enough, but I’m not sure.