Fix backwards incompatibility of unpickling in Python 3

simon-king-jena commented 5 years ago

EDIT: In the original ticket description, I stated: "I believe that a backwards incompatible change of pickling is a blocker for Python-3 support." In that (and ONLY in that) sense I believe this ticket is a blocker. I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.

The following examples require the optional meataxe package, but I am not sure yet if meataxe is to blame or Python-3 (I hope it is the former, because I guess it would be more easy to fix).

attachment: Py2.sobj and attachment: Py3.sobj result in the following behaviour in Python-3

sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-5705b555470a> in <module>()
----> 1 load('/home/king/Projekte/coho/tests/Py2.sobj')

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)()
    149 
    150     ## Load file by absolute filename
--> 151     with open(filename, 'rb') as fobj:
    152         X = loads(fobj.read(), compress=compress)
    153     try:

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2774)()
    150     ## Load file by absolute filename
    151     with open(filename, 'rb') as fobj:
--> 152         X = loads(fobj.read(), compress=compress)
    153     try:
    154         X._default_filename = os.path.abspath(filename)

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7270)()
    967 
    968     unpickler = SageUnpickler(io.BytesIO(s))
--> 969     return unpickler.load()
    970 
    971 

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]

and in Python-2

sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: __ == _
True

So, the Python-3 pickle can be unpickled in Python-2, but not the other way around. What is the problem?

Component: python3

Keywords: unpickling UnicodeError backwards compatibility

Author: Simon King

Branch: d7f170f

Reviewer: Nils Bruin

Issue created by migration from https://trac.sagemath.org/ticket/28444

simon-king-jena commented 5 years ago

comment:45

PS: Another approach was mentioned, namely to patch the pickle module, so that it is guessed whether a pickled Py-2 str is unpickled as Py-3 str or bytes. But each heuristics to do so can fail on specific data, and in such cases we would still need to add appropriate conversion from str to bytes and from bytes to str to Sage library functions, to catch the cases in which the heuristics fails.

simon-king-jena commented 5 years ago

comment:46

Here is another aspect that we could consider:

sage: all_bytes = bytes(range(0x100))
sage: len(dumps(all_bytes))
410
sage: len(dumps(all_bytes.decode(encoding='latin')))
361

So, it might actually be a good idea to NOT use bytes for pickles, but use str instead, when Python 3 is involved (I am slightly surprised, because I thought bytes objects are simpler than str objects and thus their pickle should be smaller, not larger).

When we change pickle functions to switch from using bytes to using str, it would be the corresponding unpickling functions that should be changed to accept str instead of bytes to make Py-2 pickles useable in Py-3 with encoding='latin1'. That's another reason to choose the default encoding='latin1'.

nbruin commented 5 years ago

comment:47

Replying to @simon-king-jena:

encoding='latin1' will make it so that always the unpickler will be in a position to look up globals. However, each Sage function expecting a bytes needs to be changed so that it also accepts str and transforms it to bytes with encoding "latin1".

We only need to do so at the time of unpickling, for backwards compatibility reasons (i.e., if we move to a python-3 only protocol in the future, the workaround can be turned off when unpickling at that protocol level). So instead of actually changing the (hopefully few) objects that have a "bytes" object in them to accept strings and interpret them with latin1 encoding, can we just register little wrappers in "copyreg" to do that decoding for us?

However, we also need to worry about non-Sage library objects that might occur as attributes of Sage library objects. Such as numpy arrays/datetime. Is there a hack to make unpickling of numpy arrays work even when latin1 is used? Or is bytes the only way out for them?

Backport Issue 22005 from Python 3.9?

If the copyreg option is available, perhaps the following would be a workable, extendable solutions:

make sure encoding is available for sage's unpickler access
aim for "latin1" being the default, but let 'bytes' be an option for the (rare?) pickles that need it
solve a few cases needing 'bytes', using copyreg registered constructors, so that they also work with 'latin1'
document the process of writing these copyreg constructors so that it's relatively straightforward to do it when we discover other data types that need it.

I think that gets us at least to a state where the "blocker" status of this ticket can be lifted, because there's a clear path to resolving this pickle incompatibility.

Truth is: any data type that is not in a pickle jar that gets tested regularly will almost certainly not be supported for unpickling in the future. Backwards compatibility of pickle requires real and constant work as sage develops. Major overhauls particularly will be a huge problem, and in some cases (if we don't have a version attribute) there might already be pickles that are genuinely ambiguous in how they should be unpickled.

nbruin commented 5 years ago

comment:48

Can we do something like the following? Suppose we have an object as below. It's written without particular considerations for pickling and it uses both bytes and str, so on the Py3 side, we're hooped:

class MixedObject(object):
    def __init__(self,B,S):
        if not(isinstance(B,bytes)):
            raise TypeError("Parameter B must be of type bytes")
        if not(isinstance(S,str)):
            raise TypeError("Parameter S must be ot type str")
        self.B=B
        self.S=S
    def validate(self):
        if not(isinstance(self.B,bytes)):
            raise TypeError("Parameter B must be of type bytes")
        if not(isinstance(self.S,str)):
            raise TypeError("Parameter S must be ot type str")
        return True

X=MixedObject(b"hi","there")

The pickle we get with pickle.dumps(X) from Py2 unpacks into Py3 into an object that doesn't validate. If we use pickle.loads(X,encoding="bytes") we end up with an object that doesn't even have the required attributes (because the dictionary gets reconstructed with bytes keys).

However, if we equip the object on the Py3 side with a __setstate__:

    def __setstate__(self,D):
        if isinstance(D['B'],str):
            self.B=D['B'].encode('latin1')
        else:
            self.B=D['B']
        self.S=D['S']

we can use loads(...,encoding="latin1") and get an object that validates. So with a bit of luck, I think we can deal with most cases where we do need a bytes attribute with either use copyreg to register a custom reconstructor (when an object uses a __reduce__) that can preprocess construction parameters that are to be turned into bytes or by amending/adding __setstate__ to use encode('latin1') on values that are supposed to be bytes.

It's a fairly painful process, but it should be doable entirely on the py3 side.

simon-king-jena commented 5 years ago

comment:49

Replying to @nbruin:

Replying to @simon-king-jena:

encoding='latin1' will make it so that always the unpickler will be in a position to look up globals. However, each Sage function expecting a bytes needs to be changed so that it also accepts str and transforms it to bytes with encoding "latin1".

We only need to do so at the time of unpickling, for backwards compatibility reasons (i.e., if we move to a python-3 only protocol in the future, the workaround can be turned off when unpickling at that protocol level). So instead of actually changing the (hopefully few) objects that have a "bytes" object in them to accept strings and interpret them with latin1 encoding, can we just register little wrappers in "copyreg" to do that decoding for us?

I don't know about copyreg. What and how can it do?

That said: We currently know only one type (namely Matrix_gfpn_dense) in the Sage library whose unpickling would fail if stored Python-2 str are interpreted as Python-3 str. And since a pickle with bytes is larger than a pickle with str, I'm inclined to change the pickling of Matrix_gfpn_dense anyway.

Suggestion: In a new ticket that will be a dependency for here, I'll change Matrix_gfpn_dense so that it will in future use str for pickling and will be able to use both bytes and str for unpickling- Does that make sense to you?

Afterwards, we can focus here on encoding defaults, copyreg...

Backport Issue 22005 from Python 3.9?

... and the backport.

Sorry, I just notice that my above suggestions largely coincide with your suggestion in the rest of your post. So:

If the copyreg option is available, perhaps the following would be a workable, extendable solutions:

make sure encoding is available for sage's unpickler access

That's for the ticket here, and easy.

aim for "latin1" being the default, but let 'bytes' be an option for the (rare?) pickles that need it

We should give the user the option to use options, regardless whether such options are needed for pickles of Sage library objects or not. That's easy to do.

solve a few cases needing 'bytes', using copyreg registered constructors, so that they also work with 'latin1'

I would like to solve one case in a new ticket (namely Matrix_gfpn_dense) without copyreg. If further cases pop up (perhaps in the pickle jar?), we may tackle them here with copyreg. I don't know how difficult it is.

document the process of writing these copyreg constructors so that it's relatively straightforward to do it when we discover other data types that need it.

Good idea.

I think that gets us at least to a state where the "blocker" status of this ticket can be lifted, because there's a clear path to resolving this pickle incompatibility.

I think the example from the ticket description could be easily fixed.

Truth is: any data type that is not in a pickle jar that gets tested regularly will almost certainly not be supported for unpickling in the future.

If meataxe was a standard package (there are reasons to make it standard, btw.), Matrix_gfpn_dense should certainly become part of the pickle jar.

nbruin commented 5 years ago

comment:50

Replying to @simon-king-jena:

I don't know about copyreg. What and how can it do?

From what I understand, it manages the dictionary in which "constructors" are looked up. A pickle of an object with a __reduce__ would usually amount to instructions to call a certain named constructor with arguments (encoded in the pickle as well). The name of that constructor needs to be looked up somewhere. If I'm not mistaken, copyreg manages the dictionary for that. One can also instantiate Unpickler objects (or subclass them) to customize the constructor lookup. In cases where the arguments as pickled would need preprocessing before calling the normal constructor (just class names are generally used as constructors), then it makes sense to register a factory function in place of that class name to do the preprocessing (and then call the relevant class).

EDIT: I see now that register_unpickle_override seems to hook into the same place where copyreg for the normal python unpickler goes (sage subclasses the normal unpickler and exposes that via load and loads)

Suggestion: In a new ticket that will be a dependency for here, I'll change Matrix_gfpn_dense so that it will in future use str for pickling and will be able to use both bytes and str for unpickling- Does that make sense to you?

I'm rather strongly against misusing str for binary data. That's not what it's for! And what you can see in this ticket already, is that signalling intent properly in pickle files is of utmost importance for forward/backward compatibility option. The observation that it ends up shorter as a str could be something stupid as the sequence having a UTF-8 representation that happens to be surprisingly short. If you think the pickle format for binary can be improved, then patch pickle. I think Matrix_gfpn_dense should just have the interface that makes most sense for it. It can also have a custom __reduce__ that encodes its defining data in a sensible way (it probably already has); possible zipping or otherwise compressing the data if that is appropriate (I don't think it is -- compression can always be done in a separate round)

Matrix_gfpn_dense is very easy to fix: they are ALREADY unpickled using a helper function mtx_unpickle. If that would accept as Data both bytes and str (which it can then transform in bytes using the latin1 encoder), you'd be done. That doesn't affect the normal API of the class at all.

If meataxe was a standard package (there are reasons to make it standard, btw.), Matrix_gfpn_dense should certainly become part of the pickle jar.

Isn't the problem that the pickle jar isn't really maintained/tested anymore? With the corollary that pickling in sage is doomed to suffer from compatibility problems.

simon-king-jena commented 5 years ago

Branch: u/SimonKing/fix_backwards_incompatibility_of_unpickling_in_python_3

simon-king-jena commented 5 years ago

comment:52

With the new commit, one can unpickle a previously un-unpicklable meataxe matrix:

sage: X = load('/home/king/Projekte/coho/tests/Py2.sobj')
sage: X
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]

However, we need doctests to demonstrate that this ticket fixes what it was supposed to fix.

Also, I wonder if we should strive to make it not only "legal" to use encoding='bytes' but to actually make it possible: Currently, one gets

sage: X = load('/home/king/Projekte/coho/tests/Py2.sobj', encoding='bytes')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-1bf7b48d4b86> in <module>()
----> 1 X = load('/home/king/Projekte/coho/tests/Py2.sobj', encoding='bytes')

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2863)()
    154 
    155     ## Load file by absolute filename
--> 156     with open(filename, 'rb') as fobj:
    157         X = loads(fobj.read(), compress=compress, **kwargs)
    158     try:

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2813)()
    155     ## Load file by absolute filename
    156     with open(filename, 'rb') as fobj:
--> 157         X = loads(fobj.read(), compress=compress, **kwargs)
    158     try:
    159         X._default_filename = os.path.abspath(filename)

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7318)()
    974 
    975     unpickler = SageUnpickler(io.BytesIO(s), **kwargs)
--> 976     return unpickler.load()
    977 
    978 

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/structure/factory.pyx in sage.structure.factory.lookup_global (build/cythonized/sage/structure/factory.c:4602)()
    752         pass
    753 
--> 754     if '.' in name:
    755         module, name = name.rsplit('.', 1)
    756         all = __import__(module, fromlist=[name])

TypeError: a bytes-like object is required, not 'str'

We could of course make lookup_global to accept bytes (with encoding ascii, I guess) to look up functions given by names. It will, however, be very likely that that will not be the only change needed to make encoding='bytes' work.

New commits:

2e0d11e Pass unpickling options to pickle.load, default encoding 'latin1'. Accept both str and bytes in mtx_unpickle

simon-king-jena commented 5 years ago

Commit: 2e0d11e

simon-king-jena commented 5 years ago

comment:53

Warning: Trying to make bytes encoding work is likely to open a can of worms. Such as:

sage: X = load('/home/king/Projekte/coho/tests/Py2.sobj', encoding='bytes')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-1bf7b48d4b86> in <module>()
----> 1 X = load('/home/king/Projekte/coho/tests/Py2.sobj', encoding='bytes')

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2863)()
    157 
    158     ## Load file by absolute filename
--> 159     with open(filename, 'rb') as fobj:
    160         X = loads(fobj.read(), compress=compress, **kwargs)
    161     try:

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2813)()
    158     ## Load file by absolute filename
    159     with open(filename, 'rb') as fobj:
--> 160         X = loads(fobj.read(), compress=compress, **kwargs)
    161     try:
    162         X._default_filename = os.path.abspath(filename)

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7318)()
    977 
    978     unpickler = SageUnpickler(io.BytesIO(s), **kwargs)
--> 979     return unpickler.load()
    980 
    981 

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.make_integer (build/cythonized/sage/rings/integer.c:43277)()
   7220     """
   7221     cdef Integer r = PY_NEW(Integer)
-> 7222     mpz_set_str(r.value, str_to_bytes(s), 32)
   7223     return r
   7224 

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/cpython/string.pxd in sage.cpython.string.str_to_bytes (build/cythonized/sage/rings/integer.c:47648)()
     90     # compile-time variable. We keep the Cython wrapper to deal with
     91     # the default arguments.
---> 92     return _str_to_bytes(s, encoding, errors)

TypeError: expected str, bytes found

So, it would mean to change further constructors to be flexible with regards to str and bytes.

Thus, question: Shall we try to do this? Shall we just keep the new commit, and add doctests? Further suggestion for the way to proceed?

simon-king-jena commented 5 years ago

comment:54

I just notice that the function str_to_bytes does not require that the input is a str. Hence, it may be reasonable to change it so that, if the input is a bytes, it is returned without change.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 2e0d11e to 1984bc5

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

1984bc5 Make str_to_bytes accept str input, and bytes_to_str accept bytes input. Use it in finite field and integer constructor

simon-king-jena commented 5 years ago

comment:56

As a proof of concept, I provide a commit which makes it possible to unpickle my example with encoding='bytes'. It works by making bytes_to_str accept str input (returning it unchanged) and str_to_bytes accept bytes input (returning it unchanged. Also, I used these conversion functions in some integer constructor, finite field constructor, and in mtx_unpickle.

However, I guess it would be needed to use bytes_to_str and str_to_bytes in a lot more places, therefore it is not more than a proof of concept.

Anyway, with the new commit, the following things work in Python 3:

sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: load('/home/king/Projekte/coho/tests/Py2.sobj', encoding='bytes')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: load('/home/king/Projekte/coho/tests/Py3.sobj', encoding='bytes')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: from sage.matrix.matrix_gfpn_dense import mtx_unpickle
sage: class unpickle_old_mtx:
....:     def __call__(self, *args, **kwds):
....:         return mtx_unpickle(*args, **kwds)
....:     
sage: register_unpickle_override('pGroupCohomology.mtx', 'MTX_unpickle_class', unpickle_old_mtx)
sage: X = load('/home/king/Projekte/coho/tests/State.sobj')

Of course, all these examples are using the optional meataxe package.

nbruin commented 5 years ago

comment:57

Why try to make encoding='bytes' work in so many places? The problem is that pickles travelling over the Py2/Py3 boundary have an issue that the str/bytes distinction gets squashed (because str in py2 wasn't proper unicode yet). It's pretty clear that str objects are WAY more common in python than bytes objects, so encoding with bytes is going to need a LOT of post-processing. It's much more reasonable to assume these objects are in fact (ascii) strings. However, that breaks hard if it obviously isn't. Hence the latin1 workaround. But then getting bytes objects where they should be of course requires domain-specific knowledge. So the unpickling routines must be made more tolerant and accept a str where it expects a bytes object. But preferably only the unpickling routines! Interpreting a str object as a bytes object via latin1 encoding remains a hack that could mask errors and typos, and make them harder to detect.

If you want to make things work with 'bytes' encoding as well as 'latin1' encoding, I think you're just making extra work for yourself, and if you're making the infrastructure more lax about input types, I think you can end up with masking more errors in the end. Note that the problem you're solving here is very limited: it's JUST to help pickles cross the Py2-to-Py3 boundary (the other direction is not supported at all).

With the experience we have here now, I think it's pretty clear that with enough work, you can make any pickle decode with latin1 in Py3. You'll just be registering a lot of pickle overrides ...

simon-king-jena commented 5 years ago

comment:58

Replying to @nbruin:

Why try to make encoding='bytes' work in so many places?

Because an important datatype, namely numpy arrays with datetime objects, apparently can only be unpickled using encoding='bytes'. And of course, a numpy array can be part of a larger structure that also comprises further Sage objects. Therefore it makes sense that these can be unpickled with encoding='bytes', too.

So the unpickling routines must be made more tolerant and accept a str where it expects a bytes object. But preferably only the unpickling routines!

Sure. That's what I do in the first commit (in the only place in the Sage library that really uses bytes that I am aware of).

Interpreting a str object as a bytes object via latin1 encoding remains a hack that could mask errors and typos, and make them harder to detect.

You mean the other way around.

If you want to make things work with 'bytes' encoding as well as 'latin1' encoding, I think you're just making extra work for yourself, and if you're making the infrastructure more lax about input types, I think you can end up with masking more errors in the end.

Sure. That's why I said it is a proof of concept. I show that making bytes_to_str and str_to_bytes more tolerant about input and using it more commonly in fact is a way to unpickle with encoding='bytes', showing that with some work there is a chance to unpickle complex data structures involving numpy arrays, without patching numpy.

But probably, it would be easier to just drop the new commit, and then provide some unpickle override for numpy arrays only.

Note that the problem you're solving here is very limited: it's JUST to help pickles cross the Py2-to-Py3 boundary (the other direction is not supported at all).

The other direction apparently is less critical, as I have shown that the Py-3 pickle corresponding to the ticket description can be unpickled both in Py-2 and Py-3.

With the experience we have here now, I think it's pretty clear that with enough work, you can make any pickle decode with latin1 in Py3. You'll just be registering a lot of pickle overrides ...

No. I am pretty sure that with latin1 it would be few pickle overrides. With bytes it would be lots of pickle overrides.

simon-king-jena commented 5 years ago

comment:59

Replying to @simon-king-jena:

With the experience we have here now, I think it's pretty clear that with enough work, you can make any pickle decode with latin1 in Py3. You'll just be registering a lot of pickle overrides ...

No. I am pretty sure that with latin1 it would be few pickle overrides. With bytes it would be lots of pickle overrides.

Perhaps I was confusing "pickle overrides" with "changes in unpickling functions".

Restating what I mean:

If 'latin1' is default encoding, it could very well be that the first commit already is enough to fix the Sage library. With 'bytes', a lot more changes to the Sage library would be needed.
If 'latin1' is default encoding, there will of course be places in which we should register a pickle override. So far, we know one case from hearsay, namely numpy.array. If 'bytes' is default encoding, I simply don't know how many pickle overrides we'd need to register for non-Sage data types. If I recall correctly, the links discussed above only mention numpy datetime, with the corollary of numpy arrays. But is there more?

From the above, I think we can conclude that 'latin1' is the most reasonable default encoding for unpickling. But we should decide to what extent we want to support the 'bytes' encoding for unpickling. From my perspective, it wouldn't really be needed, thus, I'm totally fine with discarding the proof-of-concept commit.

simon-king-jena commented 5 years ago

comment:60

Even if we decide to not support 'bytes' encoding, we might want to change str_to_bytes and bytes_to_str so that the former accepts bytes input and the latter accepts str input, returning the input unaltered in that case. I think it would be useful.

nbruin commented 5 years ago

comment:61

Replying to @simon-king-jena:

Even if we decide to not support 'bytes' encoding, we might want to change str_to_bytes and bytes_to_str so that the former accepts bytes input and the latter accepts str input, returning the input unaltered in that case. I think it would be useful.

Yes, that doesn't look too bad. Do we know for certain that datetime is still a problem with latin1? I don't have a problem with it when I do

pickle.dumps(datetime(1977,10,10))

and then load it in Py3 with pickle.loads(...,encoding="latin1"). With encoding="bytes" it also seems to work. When I look in Python3 Lib/datetime.py, I see reduce and __new__ methods that very explicitly have comments about "pickle support" and calls to latin1 encoding to deal with string input parameters. So I'm pretty sure this has been fixed in recent python version; not just 3.9.

So I don't think we have a convincing case where encoding=bytes is required, and we do have convincing evidence that encoding=bytes is very painful. Even in a py3 where latin1 does not work out of the box, one could provide a pickle override to do the preprocessing. Therefore, I don't think you should put extra effort in supporting encoding=bytes at the moment.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 1984bc5 to 8c105cd

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

8c105cd Make str_to_bytes/bytes_to_str accept both str and bytes input.

simon-king-jena commented 5 years ago

comment:63

I replaced the old second commit by a commit that is less invasive: It changes the conversion functions so that they return the input unaltered if its type fits, it uses the conversion functions in matrix_gfpn_dense for unpickling (I think that's cleaner than doing a type check there), and it also uses them in lookup_global, because if we want to support encoding='bytes' at all, we have to make lookup work with that option.

So, from my perspective, the code itself is ready to be used. But we should add some tests. Ideas how they could look? I could imagine to use a string that was created by dumps in Python-2 and demonstrate that the string can be interpreted by loads, regardless whether it is Py-2 or Py-3.

simon-king-jena commented 5 years ago

comment:64

Currently I try to create a doctest as follows: Have a Cython class Foo such that its constructor expects a str. Create a Foo instance f in Python 2 with a non-ascii string, and create a pickle string s with dumps(f). Then, without this ticket, loads(s) would fail in Python 3, because the non-ascii string in the pickle cannot be unpickled as a string with the default encoding chosen by Python 3. But WITH this ticket, it can be unpickled, because our default encoding is latin1.

Problem: When I do

sage: cython("""
....: from sage.structure.richcmp import richcmp
....: cdef class Foo:
....:     cdef str bar
....:     def __init__(self, data):
....:         self.bar = data
....:     def __richcmp__(self, other, op):
....:         cdef Foo Other = other
....:         return richcmp(self.bar, Other.bar, op)
....:     def __reduce__(self):
....:         return Foo, (self.bar,)
....: """)

then the module name of Foo is a temporary name that is essentially random. How can I prescribe that the module of Foo is __main__?

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 8c105cd to 712e8dd

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

712e8dd Add tests for #28444

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 712e8dd to c8a0748

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

c8a0748 Add tests for #28444

simon-king-jena commented 5 years ago

comment:67

The new commit adds tests. There is also one test that should only be executed in Python-3, and therefore I marked it # optional: python3. The tests in sage.misc.persist do pass in Sage-with-py3 and Sage-with-py2. However, oddly enough, the Py3 version of Sage adds python2, not python3, to the doctester. Is that a bug?

Anyway, I now think the ticket can be reviewed.

simon-king-jena commented 5 years ago

Author: Simon King

simon-king-jena commented 5 years ago

comment:68

Replying to @simon-king-jena:

However, oddly enough, the Py3 version of Sage adds python2, not python3, to the doctester.

I just tested: With sage -t --optional=build,ccache,dochtml,frobby,gap_packages,gdb,gfortran,libsemigroups,meataxe,memlimit,mpir,p_group_cohomology,python2,python3,sage src/sage/misc/persist.pyx, the optional py3-test passes in Sage-with-py3, but fails in Sage-with-py2, which is of course expected, as the encoding keyword doesn't exist in Python-2.

fchapoton commented 5 years ago

comment:69

The python3-only test tag is #py3

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from c8a0748 to 89ac77a

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

89ac77a Fix keyword for py3-only test

simon-king-jena commented 5 years ago

comment:71

Replying to @fchapoton:

The python3-only test tag is #py3

Thank you! The latest commit fixes it. I verified that the tests of persist.pyx pass both with python-2 and python-3.

simon-king-jena commented 5 years ago

comment:72

Oops. The doc-builder complains:

[dochtml] [misc     ] docstring of sage.misc.persist.SagePickler:21: WARNING: Inline literal start-string without end-string.
[dochtml] [misc     ] docstring of sage.misc.persist.loads:52: WARNING: Block quote ends without a blank line; unexpected unindent.
[dochtml] [misc     ] docstring of sage.misc.persist.loads:52: WARNING: Inline emphasis start-string without end-string.

simon-king-jena commented 5 years ago

Work Issues: Fix doc strings

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 89ac77a to 3693024

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

3693024 Fix doc strings in sage.misc.persist

simon-king-jena commented 5 years ago

Changed work issues from Fix doc strings to none

simon-king-jena commented 5 years ago

comment:74

Fixed, I hope.

jhpalmieri commented 5 years ago

comment:75

On OS X, all tests pass with both Python 2 and Python 3. I don't have any comments on the actual code.

nbruin commented 5 years ago

comment:76

Perhaps to improve maintainability for the future, document the uses in unpickling routines by explicitly mentioning something along the lines of

# Legacy pickles from Python 2 do not distinguish between str and bytes type objects.
# While we expect a "bytes" type object here, it could arrive here as a "str" object, for instance
# if unpickle was passed the argument "encoding='latin1'". In that case we convert the "str" to
# a "bytes" object using "encoding='latin1'". If the object is already a "bytes" object, then
# str_to_bytes will return it unchanged.

because the use of "str_to_bytes" would indicate for the uninitiated reader that a "str" object is expected, whereas the main code path is actually through "bytes". The "str" case only arises in legacy cases.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

f0828ee Add a comment regarding the expected data type for an unpickle helper

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from 3693024 to f0828ee

simon-king-jena commented 5 years ago

comment:78

Replying to @nbruin:

Perhaps to improve maintainability for the future, document the uses in unpickling routines...

Is the new commit ok?

nbruin commented 5 years ago

comment:79

Replying to @simon-king-jena:

Replying to @nbruin:

Perhaps to improve maintainability for the future, document the uses in unpickling routines...

Is the new commit ok?

There are some language errors in the text, but I think the gist of it is clear.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

ba41ebe Fix two typos in a comment

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 5 years ago

Changed commit from f0828ee to ba41ebe

simon-king-jena commented 5 years ago

comment:81

Replying to @nbruin:

There are some language errors in the text, but I think the gist of it is clear.

Oops, you are right. I tried to fix it. Is it good now?

jhpalmieri commented 5 years ago

Changed branch from u/SimonKing/fix_backwards_incompatibility_of_unpickling_in_python_3 to u/jhpalmieri/fix_backwards_incompatibility_of_unpickling_in_python_3

sagemath / sage

Fix backwards incompatibility of unpickling in Python 3 #28444