rivo / uniseg

Unicode Text Segmentation, Word Wrapping, and String Width Calculation in Go
MIT License
585 stars 61 forks source link

FirstGraphemeCluster does not need to preserve state across grapheme clusters #58

Open delthas opened 4 months ago

delthas commented 4 months ago

Hi,

The FirstGraphemeCluster function can be used to iteratively extract grapheme clusters from a string (without additional allocations). The function mentions that a state should be passed (initially set to -1), is then returned and should be passed again on the next call, in order to preserve some state across calls of this function.

This state contains the current grapheme cluster parser state, and the property of the next codepoint.

It did not make sense to me that decoding grapheme cluster depended on earlier state: I'd expected that each grapheme cluster was fully independent.

To test this, I took the full test case for grapheme cluster boundary processing of Unicode 14.0 (the version supported by the library), and ran a simple test by calling FirstGraphemeClusterInString and comparing the results with the spec:

This would mean that even when not preserving any state, the actual grapheme clusters that are returned are always the same.

So, from my understanding, there shouldn't be the need for any state at all between calls of the library; and the state parameter can be fully deprecated.

Full test case (see the TODO line), try running in the Go playground (prints All tests passed): https://gist.github.com/delthas/0965a2c198b3a114fbb6706435786b73