google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.27k stars 204 forks source link

Clarification on testdata/bytes.star #358

Closed mahmoudimus closed 3 years ago

mahmoudimus commented 3 years ago

I'm trying to emulate this bytes type and I'm going through the tests and trying to understand the behavior of one of the assert statements in the testdata/bytes.star file.

The line in question:

assert.eq(bytes("hello, 世界"[:-1]), b"hello, 世��")

Now, when trying to run this in python:

Python 3.9.2 (default, Feb 19 2021, 17:09:53)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.20.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: bytes("hello, 世界", encoding='utf-8') == 'hello, 世界'.encode('utf-8')
Out[1]: True

In [2]: (bytes("hello, 世界", encoding='utf-8')[:-1]).decode('utf-8', errors='replace')
Out[2]: 'hello, 世�'

Obviously, I'm using an encoding of utf-8 but I do believe that Golang will use UTF-K strings, where k=8 here (as opposed to Java's UTF-16 😢 ). I'm trying to understand if this behavior is accurate and it is indeed intended in this example to deviate from python?

Thanks!

EDIT:

I wanted to also clarify if the test is actually mistaken, please see below:

In [3]: (bytes("hello, 世界"[:-1], encoding='utf-8'))
Out[3]: b'hello, \xe4\xb8\x96'

vs

In [4]: (bytes("hello, 世界", encoding='utf-8'))[:-1]
Out[4]: b'hello, \xe4\xb8\x96\xe7\x95'

Did the test mean to slice the actual literal or the bytes object?

adonovan commented 3 years ago

In [2]: (bytes("hello, 世界", encoding='utf-8')[:-1]).decode('utf-8', errors='replace') Out[2]: 'hello, 世�'

Your question here, I think, is: why does Go return two U+FFFDs but Python only one, when they decode the bytes "\xe7\x95" as UTF-8? The answer is that Unicode doesn't require either behavior, but it's slightly easier to implement the variant that always advances the input stream by one byte each time decoding fails (as Go does). That's the behavior we decided to require in the Starlark spec.

I wanted to also clarify if the test is actually mistaken, please see below: (bytes("hello, 世界"[:-1], encoding='utf-8')) => b'hello, \xe4\xb8\x96' (bytes("hello, 世界", encoding='utf-8'))[:-1] => b'hello, \xe4\xb8\x96\xe7\x95' Did the test mean to slice the actual literal or the bytes object?

In Python, "hello, 世界"[:-1] means "hello, 世", because the [:-1] cuts off the last code point. But in Starlark, it cuts off the last UTF-K code unit (e.g. byte in Go, char in Java), yielding an invalid string ("hello, 世" plus two thirds of the encoding of ). This is intended: the test ensures that each invalid byte in a text string is replaced by U+FFFD.

mahmoudimus commented 3 years ago

Hi @adonovan - thank you for the clarification, that makes a lot of sense!