golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.23k stars 17.7k forks source link

x/text: add grapheme cluster iteration #14820

Open cgilling opened 8 years ago

cgilling commented 8 years ago

Hi, I'm in the middle of implementing support for iterating over grapheme clusters in a project that I am working on and it seems like something that would be a good fit for the golang.org/x/text. I wanted to reach out and see how much interest there would be around this and whether I should work on making something that would fit into this project. I was thinking the interface could be somewhat like this (naming just a stand-in for now, not a big fan of the name decode) :

package grapheme

// Decode reads the first grapheme cluster out of s and return it. To get the length of the
// grapheme simply take the len() of the return value.
func Decode(s string) string

I didn't want to go through the whole proposal process until I get an idea of whether there might be interest for this. I hope this is the right forum for this, if not, I'd appreciate being pointed to the right place.

Thanks

mpvl commented 8 years ago

I have a segment package planned, that would provide an API for defining any kind of segmentation. The advantage of a single API for grapheme, word, line, sentence, etc. breaking and segmentation is that it promotes reuse of sometimes complicated code.

It may be a while before this is done. However, in the mean time, you can now already approximate Grapheme Cluster Iteration using "golang.org/x/text/unicode/norm".Iter. Normalization segments are not entirely the same. but it is sufficiently close for many applications..

SamWhited commented 8 years ago

See also #17256

rivo commented 5 years ago

Because the normalization package didn't do the trick in many cases, I went ahead and implemented grapheme cluster segmentation in the following package:

https://github.com/rivo/uniseg

It passes the grapheme cluster break test cases so I'm fairly confident that it works as expected. But since it's a new project, I appreciate any bug reports.

I might add Word Boundaries and Sentence Boundaries, too, at some point. But for now, it's not my main focus.

I don't know if there's any interest in moving this to x/text at some point. I'm open to that but I'd like to know the efforts and responsibilities that would come with that. Get in touch if you want to push this forward.

SamWhited commented 3 years ago

@mpvl I've been needing an implementation of this for a project recently and have been considering writing up a design document for it. However, it sounds like you've got a more general purpose API in mind already. Would you have the time to write that up and post it somewhere? If you aren't planning an implementation in the immediate future it's possible that I'll be writing one anyways, and I'd much rather write something that stands a chance of eventually being upstreamed. Thanks!