roc-lang / roc

A fast, friendly, functional language.
https://roc-lang.org
Universal Permissive License v1.0
4.47k stars 315 forks source link

Str.graphemes incorrectly groups grapheme clusters #4779

Open rtfeldman opened 1 year ago

rtfeldman commented 1 year ago

To reproduce:

ยป "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ" |> Str.toScalars

[128105, 8205, 128105, 8205, 128102, 8205, 128102] : List U32

ยป "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ" |> Str.graphemes
["๐Ÿ‘ฉโ€", "๐Ÿ‘ฉโ€", "๐Ÿ‘ฆโ€", "๐Ÿ‘ฆ"] : List Str

The toScalars part is correct, but toGraphemes is incorrect. It should return ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ"] : List Str - just one element in the list.

Swift does this correctly.

yawaramin commented 11 months ago

OCaml REPL session using the uuseg (linked to source code) library:

# #require "uuseg.string";;
# List.rev (Uuseg_string.fold_utf_8 `Grapheme_cluster (fun list segment -> segment :: list) [] "๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ");;
- : string list = ["๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ"]
lukewilliamboswell commented 11 months ago

For some additional context to this issue, the plan is to remove unicode text segmentation from builtins and move to a library over at roc-lang/unicode. That is a work in progress.