rivo / uniseg

Unicode Text Segmentation, Word Wrapping, and String Width Calculation in Go
MIT License
585 stars 61 forks source link

Result difference between uniseg.GraphemeClusterCount and PHP grapheme_strlen #30

Closed jiwandono closed 1 year ago

jiwandono commented 1 year ago

Hello!

I observed the following difference with the mentioned sequences. To be honest I'm not sure which one is correct, but could you help to confirm if the result is expected with uniseg library?

Thank you!

--

Golang with uniseg.GraphemeClusterCount

package main

import (
    "fmt"

    "github.com/rivo/uniseg" // v0.4.3
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("\r\n\uFE0E"))
    fmt.Println(uniseg.GraphemeClusterCount("\n\uFE0E"))
}

Output:

1
2

https://goplay.tools/snippet/WBIJQfKZs7g

PHP 8.0.28 with grapheme_strlen

<?php
printf("%d\n", grapheme_strlen("\r\n\u{FE0E}"));
printf("%d\n", grapheme_strlen("\n\u{FE0E}"));

Output:

2
2

https://onlinephp.io/c/2cb86

jiwandono commented 1 year ago

On python with https://pypi.org/project/grapheme/

>>> grapheme.length('\r\n\U0000FE0E')
2
>>> grapheme.length('\n\U0000FE0E')
2
rivo commented 1 year ago

Good catch! One of the transitions in the state machine was not correct, which led to the rule GB9 to be preferred over GB4. This should be fixed now.

It's interesting that the official Unicode test cases do not include this combination. (It's not a typical string found in the wild but they should still include this one.)