Open cgilling opened 8 years ago
I have a segment package planned, that would provide an API for defining any kind of segmentation. The advantage of a single API for grapheme, word, line, sentence, etc. breaking and segmentation is that it promotes reuse of sometimes complicated code.
It may be a while before this is done. However, in the mean time, you can now already approximate Grapheme Cluster Iteration using "golang.org/x/text/unicode/norm".Iter. Normalization segments are not entirely the same. but it is sufficiently close for many applications..
See also #17256
Because the normalization package didn't do the trick in many cases, I went ahead and implemented grapheme cluster segmentation in the following package:
https://github.com/rivo/uniseg
It passes the grapheme cluster break test cases so I'm fairly confident that it works as expected. But since it's a new project, I appreciate any bug reports.
I might add Word Boundaries and Sentence Boundaries, too, at some point. But for now, it's not my main focus.
I don't know if there's any interest in moving this to x/text
at some point. I'm open to that but I'd like to know the efforts and responsibilities that would come with that. Get in touch if you want to push this forward.
@mpvl I've been needing an implementation of this for a project recently and have been considering writing up a design document for it. However, it sounds like you've got a more general purpose API in mind already. Would you have the time to write that up and post it somewhere? If you aren't planning an implementation in the immediate future it's possible that I'll be writing one anyways, and I'd much rather write something that stands a chance of eventually being upstreamed. Thanks!
Hi, I'm in the middle of implementing support for iterating over grapheme clusters in a project that I am working on and it seems like something that would be a good fit for the
golang.org/x/text
. I wanted to reach out and see how much interest there would be around this and whether I should work on making something that would fit into this project. I was thinking the interface could be somewhat like this (naming just a stand-in for now, not a big fan of the name decode) :I didn't want to go through the whole proposal process until I get an idea of whether there might be interest for this. I hope this is the right forum for this, if not, I'd appreciate being pointed to the right place.
Thanks