google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.26k stars 204 forks source link

Bug: incorrect `len()` for UTF-8 Chars #482

Closed Starshipping closed 1 year ago

Starshipping commented 1 year ago

Python:

Python 3.11.3 (main, May  3 2023, 23:19:07) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "資源互換檔案格式"
>>> len(a)
8

Starlark Go:

Welcome to Starlark (go.starlark.net)
>>> a = "資源互換檔案格式"
>>> len(a)
24
adonovan commented 1 year ago

Python3's strings are sequences of Unicode code points, of which "資源互換檔案格式" contains 8. But Starlark strings are sequences of UTF-k codes, where k=8 in the Go implementation and 16 in the Java implementation, of which that string contains 24, since each Hanzi has a 3-byte UTF-8 encoding. So this is working as intended.