wangkuiyi / recordio

Apache License 2.0
11 stars 2 forks source link

Data corruption when writing non-ascii text #30

Closed zou000 closed 5 years ago

zou000 commented 5 years ago
>>> w = recordio.Writer('/tmp/tt')
>>> w.write("蚂蚁")
>>> w.close()
>>> s = recordio.Scanner('/tmp/tt')
>>> s.record()
b'\xe8\x9a'

To avoid this category of problems, we should only support write binary bytes records and rely on user to do the correct encoding/decoding.

wangkuiyi commented 5 years ago

Is the crashing due to the lack of the b prefix?

recordio.Write(b'世界你好')
zou000 commented 5 years ago

Is the crashing due to the lack of the b prefix?

recordio.Write(b'世界你好')

It is crashing because the length of string is different from length of encoded bytes. b'世界你好' is not a valid python expression.

>>> b'世界你好'
  File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
>>>
wangkuiyi commented 5 years ago

I see. Then if I am going to write a UTF-8 string as a record, what am I supposed to do?

zou000 commented 5 years ago

I see. Then if I am going to write a UTF-8 string as a record, what am I supposed to do?

Need to explicity encode/decode for write and read. Added examples in the test file.

w.write("蚂蚁".encode())
s.record().decode()