Up till now ogórek works relatively well for me but experience gained
with ZODB/go and Wendelin.core highlighted two problems with strings:
1) loading, then re-saving string data is not generally identity, and
2) there is no way to distinguish binary data saved via py2 str from
unicode strings saved via either py2 unicode or py3 str.
Let me explain those problems:
1) Loading, then re-saving string data is not generally identity
On decoding ogórek currently loads both byte-string (STRING opcodes)
and unicode string (UNICODE opcodes) into the same Go type string. And
on encoding that string type is encoded as STRING for protocol <= 2 and
as UNICODE for protocol >= 3. It was me to implement that encoding
logic in 2018 in e7d96969 (encoder: Fix string wrt protocol version)
where in particular I did
- if protocol >= 3 we have to emit the string as unicode pickle object
the same way as Python3 does. If we don't do - Python3 won't be
generally able to load our pickle ...
with the idea that protocol=3 always means that the pickle is intended
to be for Python3.
But there I missed that zodbpickle can use and generate pickles with
protocol=3 even on py2 and that ZODB/py2 actually uses this protocol=3
mode since https://github.com/zopefoundation/ZODB/commit/12ee41c4
authored in the same 2018.
So, there can be pickles saved under protocol=3 that do contain both
STRING and UNICODE opcodes, and when ogórek sees those it loads that
STRING and UNICODE data as just string loosing the information about
type of particular variant. And then on encoding the data is saved as
all UNICODE, or all STRING, thus breaking decode/encode=identity
property.
This breakage is there even with plain pickle and protocol=2 - when both
STRING and UNICODE are in the database, ogórek loads them both as the
same Go type string, and then saving back under the same protocol=2 goes
as all STRING resulting in unicode objects becoming str on resave
without any intended change.
2) there is no way to distinguish binary data saved via py2 str from unicode strings saved via either py2 unicode or py3 str
Continuing above example of py2 database with both STRING and UNICODE
opcodes present there is currently no way for application to distinguish
those from each other. In other words there is currently no way for the
application to distinguish whether it is binary data coming from py2
protocol <= 2 era, from unicode text.
The latter problem I hit for real: with Wendelin.core we have lots of
data saved from under Python2 as just str. And the Go part of
Wendlin.core, upon loading block of data, wants to accept only binary -
either bytes from py3 or bytestring from py2, but not unicode, because
it indicates a mistake if e.g. a ZBlk object would come with unicode data
but there is currently no way to distinguish whether it was unicode or
bytestring saved into the database becase they both are represented as
the same Go string type after decoding.
So to solve those problems I thought it over and understood that the
issues start to appear becase we let STRING and UNICODE to become
mixed into the same entity on loading. This behaviour is there from
ogórek beginning and the intention, it seems, was to get the data from
the pickle stream in an easily-accepted form on the Go side. However the
convenience turned out to come with cost of loosing some correctness in
the general case as explained above.
So if we are to fix the correctness we need to change that behaviour and
load STRING and UNICODE opcodes into distinct types, so that the
information about what was what is preserved and it becomes possible to
distinguish bytestring from unicode strings and resave the data in
exactly the same form as loaded. Though we can do this only under opt-in
option with default behaviour staying as it was before to preserve
backward compatibility.
-> Do it.
Below is excerpt from doc.go and DecoderConfig and EncoderConfig changes that
describe the new system:
For strings there are two modes. In the first, default, mode both py2/py3
str and py2 unicode are decoded into string with py2 str being considered
as UTF-8 encoded. Correspondingly for protocol ≤ 2 Go string is encoded as
UTF-8 encoded py2 str, and for protocol ≥ 3 as py3 str / py2 unicode.
ogórek.ByteString can be used to produce bytestring objects after encoding
even for protocol ≥ 3. This mode tries to match Go string with str type of
target Python depending on protocol version, but looses information after
decoding/encoding cycle:
py2/py3 str ↔ string StrictUnicode=n mode, default
py2 unicode → string
py2 str ← ogórek.ByteString
However with StrictUnicode=y mode there is 1-1 mapping in between py2
unicode / py3 str vs Go string, and between py2 str vs ogórek.ByteString.
In this mode decoding/encoding and encoding/decoding operations are always
identity with respect to strings:
py2 unicode / py3 str ↔ string StrictUnicode=y mode
py2 str ↔ ogórek.ByteString
For bytes, unconditionally to string mode, there is direct 1-1 mapping in
between Python and Go types:
bytes ↔ ogórek.Bytes (~)
bytearray ↔ []byte
--------
type DecoderConfig struct {
// StrictUnicode, when true, requests to decode to Go string only
// Python unicode objects. Python2 bytestrings (py2 str type) are
// decoded into ByteString in this mode...
StrictUnicode bool
}
type EncoderConfig struct {
// StrictUnicode, when true, requests to always encode Go string
// objects as Python unicode independently of used pickle protocol...
StrictUnicode bool
}
Since strings are now split into two types, string and ByteString, and
ByteString can either mean text or binary data, new AsString and AsBytes
helpers are also added to handle string and binary data in uniform way
supporting both py2 and py3 databases. Corresponding excerpts from
doc.go and typeconv.go changes with the description of those helpers
come below:
On Python3 strings are unicode strings and binary data is represented by
bytes type. However on Python2 strings are bytestrings and could contain
both text and binary data. In the default mode py2 strings, the same way as
py2 unicode, are decoded into Go strings. However in StrictUnicode mode py2
strings are decoded into ByteString - the type specially dedicated to
represent them on Go side. There are two utilities to help programs handle
all those bytes/string data in the pickle stream in uniform way:
- the program should use AsString if it expects text data -
either unicode string, or byte string.
- the program should use AsBytes if it expects binary data -
either bytes, or byte string.
Using the helpers fits into Python3 strings/bytes model but also allows to
handle the data generated from under Python2.
--------
// AsString tries to represent unpickled value as string.
//
// It succeeds only if the value is either string, or ByteString.
// It does not succeed if the value is Bytes or any other type.
//
// ByteString is treated related to string because ByteString represents str
// type from py2 which can contain both string and binary data.
func AsString(x interface{}) (string, error) {
// AsBytes tries to represent unpickled value as Bytes.
//
// It succeeds only if the value is either Bytes, or ByteString.
// It does not succeed if the value is string or any other type.
//
// ByteString is treated related to Bytes because ByteString represents str
// type from py2 which can contain both string and binary data.
func AsBytes(x interface{}) (Bytes, error) {
ZODB/go and Wendelin.core intend to switch to using StrictUnicode mode
while leaving ogórek to remain 100% backward-compatible in its default
mode for other users.
Hello @kisielk.
Please find attached patches that add new StrictUnicode mode. The main change comes in the last patch, whose description I quote below.
Thanks beforehand, Kirill
/cc @perrinjerome, @vnmabus, @levinzimmermannn
P.S. The PR builds on top of https://github.com/kisielk/og-rek/pull/67 because the changes are interdependent.
---- 8< ---- Add StrictUnicode mode
Up till now ogórek works relatively well for me but experience gained with ZODB/go and Wendelin.core highlighted two problems with strings:
1) loading, then re-saving string data is not generally identity, and 2) there is no way to distinguish binary data saved via py2 str from unicode strings saved via either py2 unicode or py3 str.
Let me explain those problems:
1) Loading, then re-saving string data is not generally identity
On decoding ogórek currently loads both byte-string (STRING opcodes) and unicode string (UNICODE opcodes) into the same Go type string. And on encoding that string type is encoded as STRING for protocol <= 2 and as UNICODE for protocol >= 3. It was me to implement that encoding logic in 2018 in e7d96969 (encoder: Fix string wrt protocol version) where in particular I did
with the idea that protocol=3 always means that the pickle is intended to be for Python3.
But there I missed that zodbpickle can use and generate pickles with protocol=3 even on py2 and that ZODB/py2 actually uses this protocol=3 mode since https://github.com/zopefoundation/ZODB/commit/12ee41c4 authored in the same 2018.
So, there can be pickles saved under protocol=3 that do contain both STRING and UNICODE opcodes, and when ogórek sees those it loads that STRING and UNICODE data as just string loosing the information about type of particular variant. And then on encoding the data is saved as all UNICODE, or all STRING, thus breaking decode/encode=identity property.
This breakage is there even with plain pickle and protocol=2 - when both STRING and UNICODE are in the database, ogórek loads them both as the same Go type string, and then saving back under the same protocol=2 goes as all STRING resulting in unicode objects becoming str on resave without any intended change.
2) there is no way to distinguish binary data saved via py2 str from unicode strings saved via either py2 unicode or py3 str
Continuing above example of py2 database with both STRING and UNICODE opcodes present there is currently no way for application to distinguish those from each other. In other words there is currently no way for the application to distinguish whether it is binary data coming from py2 protocol <= 2 era, from unicode text.
The latter problem I hit for real: with Wendelin.core we have lots of data saved from under Python2 as just str. And the Go part of Wendlin.core, upon loading block of data, wants to accept only binary - either bytes from py3 or bytestring from py2, but not unicode, because it indicates a mistake if e.g. a ZBlk object would come with unicode data
https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/bigfile/file_zodb.py#L267-300 https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L31-32 https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L98-107
but there is currently no way to distinguish whether it was unicode or bytestring saved into the database becase they both are represented as the same Go string type after decoding.
So to solve those problems I thought it over and understood that the issues start to appear becase we let STRING and UNICODE to become mixed into the same entity on loading. This behaviour is there from ogórek beginning and the intention, it seems, was to get the data from the pickle stream in an easily-accepted form on the Go side. However the convenience turned out to come with cost of loosing some correctness in the general case as explained above.
So if we are to fix the correctness we need to change that behaviour and load STRING and UNICODE opcodes into distinct types, so that the information about what was what is preserved and it becomes possible to distinguish bytestring from unicode strings and resave the data in exactly the same form as loaded. Though we can do this only under opt-in option with default behaviour staying as it was before to preserve backward compatibility.
-> Do it.
Below is excerpt from doc.go and DecoderConfig and EncoderConfig changes that describe the new system:
Since strings are now split into two types, string and ByteString, and ByteString can either mean text or binary data, new AsString and AsBytes helpers are also added to handle string and binary data in uniform way supporting both py2 and py3 databases. Corresponding excerpts from doc.go and typeconv.go changes with the description of those helpers come below:
ZODB/go and Wendelin.core intend to switch to using StrictUnicode mode while leaving ogórek to remain 100% backward-compatible in its default mode for other users.