sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.44k stars 480 forks source link

Fix backwards incompatibility of unpickling in Python 3 #28444

Closed simon-king-jena closed 5 years ago

simon-king-jena commented 5 years ago

EDIT: In the original ticket description, I stated: "I believe that a backwards incompatible change of pickling is a blocker for Python-3 support." In that (and ONLY in that) sense I believe this ticket is a blocker. I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.

The following examples require the optional meataxe package, but I am not sure yet if meataxe is to blame or Python-3 (I hope it is the former, because I guess it would be more easy to fix).

attachment: Py2.sobj​ and attachment: Py3.sobj​ result in the following behaviour in Python-3

sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-5705b555470a> in <module>()
----> 1 load('/home/king/Projekte/coho/tests/Py2.sobj')

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)()
    149 
    150     ## Load file by absolute filename
--> 151     with open(filename, 'rb') as fobj:
    152         X = loads(fobj.read(), compress=compress)
    153     try:

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2774)()
    150     ## Load file by absolute filename
    151     with open(filename, 'rb') as fobj:
--> 152         X = loads(fobj.read(), compress=compress)
    153     try:
    154         X._default_filename = os.path.abspath(filename)

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7270)()
    967 
    968     unpickler = SageUnpickler(io.BytesIO(s))
--> 969     return unpickler.load()
    970 
    971 

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]

and in Python-2

sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: __ == _
True

So, the Python-3 pickle can be unpickled in Python-2, but not the other way around. What is the problem?

Component: python3

Keywords: unpickling UnicodeError backwards compatibility

Author: Simon King

Branch: d7f170f

Reviewer: Nils Bruin

Issue created by migration from https://trac.sagemath.org/ticket/28444

simon-king-jena commented 5 years ago

File that cannot be unpickled in Python-3

nbruin commented 5 years ago
comment:1

Attachment: State.sobj.gz

Can Sage/Py3 produce the pickle? In that case, you could compare the produced pickles to see how far apart they are. Of course, if an ASCII decoder encounters 0x80 it's justified to not decode it, so it might be interesting to see what py3 makes from it itself. My guess would be that the bytestring should NOT be decoded by ascii, but something else. Perhaps unpickle can be configured to use a different decoder. But it would be good to see what generates the non-ascii symbol and what its meaning is.

simon-king-jena commented 5 years ago
comment:2

Replying to @nbruin:

Of course, if an ASCII decoder encounters 0x80 it's justified to not decode it

Then the same should hold for Python-2. It doesn't. Hence, it shouldn't hold for Python-3 either.

simon-king-jena commented 5 years ago
comment:3

Replying to @nbruin:

Can Sage/Py3 produce the pickle?

The problem is that the pickle comes from an old version of an optional Sage package. That's why I use the "unpickle override". But I'll see what I can do.

simon-king-jena commented 5 years ago
comment:4

The new attachment was created with Python-2 and can be used without "unpickle override", but it requires the optional meataxe spkg.

simon-king-jena commented 5 years ago

Pickle of MeatAxe in Python-2

simon-king-jena commented 5 years ago

Attachment: Py2.sobj.gz

Attachment: Py3.sobj.gz

Pickle of MeaAxe matrix created with Python-3

simon-king-jena commented 5 years ago
comment:5

I think now I have a very small example. It uses the optional meataxe package. It would be a good news if actually the meataxe wrapper was to blame for the unpickling problem --- but I am not expert enough to tell whether (1) it is the case and (2) how it could be fixed (if it was the case).

Anyway. I have a new version of attachment: Py2.sobj, and a new attachment: Py3.sobj. It results in the following behaviour in Python-3

sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-5705b555470a> in <module>()
----> 1 load('/home/king/Projekte/coho/tests/Py2.sobj')

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)()
    149 
    150     ## Load file by absolute filename
--> 151     with open(filename, 'rb') as fobj:
    152         X = loads(fobj.read(), compress=compress)
    153     try:

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2774)()
    150     ## Load file by absolute filename
    151     with open(filename, 'rb') as fobj:
--> 152         X = loads(fobj.read(), compress=compress)
    153     try:
    154         X._default_filename = os.path.abspath(filename)

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7270)()
    967 
    968     unpickler = SageUnpickler(io.BytesIO(s))
--> 969     return unpickler.load()
    970 
    971 

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]

and in Python-2

sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: __ == _
True

So, the Python-3 pickle can be unpickled in Python-2, but not the other way around. What is the problem?

simon-king-jena commented 5 years ago
comment:6

Note that the Python-3 pickle is as much as 25% larger than the Python-2 pickle. Is that regression typical?

simon-king-jena commented 5 years ago

Description changed:

--- 
+++ 
@@ -1,16 +1,16 @@
-The following happens in Sage with Python-3, when trying to unpickle the attached file `State.sobj`:
+EDIT: I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.
+
+
+The following examples require the optional meataxe package, but I am not sure yet if meataxe is to blame or Python-3 (I hope it is the former, because I guess it would be more easy to fix).
+
+[attachment: Py2.sobj​](https://github.com/sagemath/sage/files/ticket28444/a2c0b4ca4b28d693a6c86ba0e2532b24.gz) and [attachment: Py3.sobj​](https://github.com/sagemath/sage/files/ticket28444/c19ee254b667c7a2b06c5025e4a8fcb1.gz) result in the following behaviour in Python-3

-sage: class unpickle_old_mtx: -....: def call(self, *args, **kwds): -....: return None -....:
-sage: register_unpickle_override('pGroupCohomology.mtx', 'MTX_unpickle_class', unpickle_old_mtx) -sage: X = load('/home/king/Projekte/coho/tests/State.sobj') +sage: load('/home/king/Projekte/coho/tests/Py2.sobj')

UnicodeDecodeError Traceback (most recent call last) - in () -----> 1 X = load('/home/king/Projekte/coho/tests/State.sobj') + in () +----> 1 load('/home/king/Projekte/coho/tests/Py2.sobj')

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)() 149 @@ -34,95 +34,20 @@ 971

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128) +sage: load('/home/king/Projekte/coho/tests/Py3.sobj') +[1 0 0 0 0 0 0 0] +[0 0 0 1 1 1 1 1]

-
-Doing the same in Sage-with-Python-2, one gets
+and in Python-2

-sage: class unpickle_old_mtx: -....: def call(self, *args, **kwds): -....: return None -....:
-sage: register_unpickle_override('pGroupCohomology.mtx', 'MTX_unpickle_class', unpickle_old_mtx) -sage: X = load('/home/king/Projekte/coho/tests/State.sobj') -sage: X -([[1,

simon-king-jena commented 5 years ago

Changed keywords from unpickling UnicodeError to unpickling UnicodeError meataxe

fchapoton commented 5 years ago
comment:8

This is in no way a blocker, IMHO.

simon-king-jena commented 5 years ago
comment:9

Replying to @fchapoton:

This is in no way a blocker, IMHO.

If it is due to meataxe (an optional package), then it is not a blocker. If it is due to the upcoming switch to Python-3, then IMHO it is a blocker to that switch (not to a Python-2 version of Sage, though). Since currently it isn't clear if the example reveals a problem in Python-3 or not, I'd say better safe than sorry.

fchapoton commented 5 years ago
comment:10

So we agree that this is not a blocker for the upcoming 8.9 release, still py2. This is the usual meaning of blocker. But in this time of transition, we must be clearer about what blocker means.

simon-king-jena commented 5 years ago

Description changed:

--- 
+++ 
@@ -1,4 +1,4 @@
-EDIT: I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.
+EDIT: In the original ticket description, I stated: "I believe that a backwards incompatible change of pickling is a blocker for Python-3 support." In that (and ONLY in that) sense I believe this ticket is a blocker. I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.

 The following examples require the optional meataxe package, but I am not sure yet if meataxe is to blame or Python-3 (I hope it is the former, because I guess it would be more easy to fix).
simon-king-jena commented 5 years ago
comment:11

Replying to @fchapoton:

So we agree that this is not a blocker for the upcoming 8.9 release, still py2. This is the usual meaning of blocker. But in this time of transition, we must be clearer about what blocker means.

In the original ticket description, I told in what sense I believe it was a blocker, but I somehow deleted that clarification when I changed the ticket description. Now, the statement is back, at the top of the ticket description.

fchapoton commented 5 years ago
comment:12

could you provide the needed register unpickle override ?

Trying to load with py3, I stumble on

ModuleNotFoundError: No module named 'sage.matrix.matrix_gfpn_dense'
simon-king-jena commented 5 years ago
comment:13

Replying to @fchapoton:

could you provide the needed register unpickle override ?

Trying to load with py3, I stumble on

ModuleNotFoundError: No module named 'sage.matrix.matrix_gfpn_dense'

As stated in the ticket description, the pickle is supposed to be loadable with the optional meataxe package (followed by sage -b) installed. However, I believe (but cannot test it, as I do not have Sage without meataxe) that the following register unpickle override would work:

sage: def unpickler(*args, **kwds):
....:     return None
....: 
sage: register_unpickle_override('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle', unpickler)
sage: load('/home/king/Projekte/coho/tests/Py2.sobj')

I suppose the "load()" command will return None in Python-2 but result in an error with Python-3 (even without meataxe installed).

fchapoton commented 5 years ago
comment:14

using your empty unpickler, I still get the same ModuleNotFoundError

simon-king-jena commented 5 years ago
comment:15

Replying to @fchapoton:

using you empty unpickler, I still get the same ModuleNotFoundError

I guess the matrix space is a matrix space with implementation=meataxe, and the parent of a matrix is also part of the pickle.

So, we'll override unpickling the matrix space as well:

sage: def MS_unpickler(*args, **kwds):
....:     return MatrixSpace(*(args[:4]),**kwds)
....: 
sage: def mtx_unpickler(*args, **kwds):
....:     return None
....: 
sage: register_unpickle_override('sage.matrix.matrix_space', 'MatrixSpace', MS_unpickler)
sage: register_unpickle_override('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle', mtx_unpickler)
sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
fchapoton commented 5 years ago
comment:16

Same thing with the new unpicklers proposal. I tried something else:

sage: explain_pickle(open('Py2.sobj', 'rb').read())
pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
si = pg_make_integer('2')
pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, '\x80\x1f', True)

sage: explain_pickle(open('Py3.sobj', 'rb').read())
pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
si = pg_make_integer('2')
pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
pg_encode = unpickle_global('_codecs', 'encode')
pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, pg_encode('\x80\x1f', 'latin1'), True)
simon-king-jena commented 5 years ago
comment:17

Replying to @fchapoton:

Same thing with the new unpicklers proposal.

Can you try to add another line:

sage: register_unpickle_override('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense', type)
simon-king-jena commented 5 years ago
comment:18

Replying to @fchapoton:

I tried something else:

sage: explain_pickle(open('Py2.sobj', 'rb').read())
pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
si = pg_make_integer('2')
pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, '\x80\x1f', True)

sage: explain_pickle(open('Py3.sobj', 'rb').read())
pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
si = pg_make_integer('2')
pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
pg_encode = unpickle_global('_codecs', 'encode')
pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, pg_encode('\x80\x1f', 'latin1'), True)

Do I see correctly that the only difference is the line

pg_encode = unpickle_global('_codecs', 'encode')

that is in Python-3 but not in Python-2? What are the implications?

fchapoton commented 5 years ago
comment:19

There is another difference at the end of the last line of the explain pickles, that contains 0x80..

With the third "register_unpickle, I now get

sage: load('Py3.sobj')

works ie returns None. And

sage: load('Py2.sobj')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-38-45fcfa858fc2> in <module>()
----> 1 load('Py2.sobj')

/home/chapoton/sage3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)()
    149 
    150     ## Load file by absolute filename
--> 151     with open(filename, 'rb') as fobj:
    152         X = loads(fobj.read(), compress=compress)
    153     try:

/home/chapoton/sage3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2774)()
    150     ## Load file by absolute filename
    151     with open(filename, 'rb') as fobj:
--> 152         X = loads(fobj.read(), compress=compress)
    153     try:
    154         X._default_filename = os.path.abspath(filename)

/home/chapoton/sage3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7270)()
    967 
    968     unpickler = SageUnpickler(io.BytesIO(s))
--> 969     return unpickler.load()
    970 
    971 

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
simon-king-jena commented 5 years ago
comment:20

Replying to @fchapoton:

There is another difference at the end of the last line of the explain pickles, that contains 0x80..

Right.

Do I understand correctly, that the string with 0x80 is swallowed without problem by Python-2, both with and without explicit encoding, while Python-3 will swallow it ONLY with an explicit encoding?

Then: How to make it so that temporarily the unpickler assumes a default encoding?

With the third "register_unpickle, I now get ...

Hooray! So, it is now possible to investigate the core problem.

simon-king-jena commented 5 years ago
comment:21

Aha! I guess the problem is indeed that the string passed to the matrix constructor is supposed to be interpreted as bytes. Namely:

sage: pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
....: pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
....: pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
....: pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
....: pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
....: pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
....: si = pg_make_integer('2')
....: pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
sage: pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField
....: '), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {
....: }), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, '\x80\x1f', True)
....: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-5506f597954b> in <module>()
----> 1 pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8, 9, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2, 8, False, pg_Matrix_gfpn_dense), {}), 2, 8, '\x80\x1f', True)

TypeError: Argument 'Data' has incorrect type (expected bytes, got str)
sage: pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField
....: '), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {
....: }), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, b'\x80\x1f', True)
....: 
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]

(of course, this is with meataxe installed).

So, is it possible to automatically load a Python-2 string as a Python-3 bytes, as if one simply puts a b in front of the string (that's what I did in the last line of the above example)?

fchapoton commented 5 years ago
comment:22

Happy to see progress. But sorry, I will now turn offline..

simon-king-jena commented 5 years ago
comment:23

Everything boils down to the following:

So, it is really the question: How can I (temporarily) force Python-3 to interpret a pickled string as bytes?

nbruin commented 5 years ago
comment:24

Replying to @simon-king-jena:

So, it is really the question: How can I (temporarily) force Python-3 to interpret a pickled string as bytes?

It looks like other people have run into this problem:

https://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3

jhpalmieri commented 5 years ago
comment:25

Replying to @nbruin:

Replying to @simon-king-jena:

So, it is really the question: How can I (temporarily) force Python-3 to interpret a pickled string as bytes?

It looks like other people have run into this problem:

https://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3

I was wondering about one of the approaches mentioned there: within Python 2, unpickle the data, then save it in a format which can be read by Python 3, or ideally, read by both Python 2 and 3.

nbruin commented 5 years ago
comment:26

Replying to @jhpalmieri:

I was wondering about one of the approaches mentioned there: within Python 2, unpickle the data, then save it in a format which can be read by Python 3, or ideally, read by both Python 2 and 3.

I think the main point here is to be able to solve it on the Py3 side: have a BACKWARD compatible solution. If we can work on both sides, there are plenty of work-arounds, for instance, write the data as JSON and parse that on the other side.

That said, I think this is another example that shows that we should be serious about keeping VMs of a reasonable spread of sage versions -- people might end up needing them for data archaeology. (do we do this already? I know VMs get produced for a lot of releases; archiving them would just be a matter of resources.

jhpalmieri commented 5 years ago
comment:27

Replying to @nbruin:

Replying to @jhpalmieri:

I was wondering about one of the approaches mentioned there: within Python 2, unpickle the data, then save it in a format which can be read by Python 3, or ideally, read by both Python 2 and 3.

I think the main point here is to be able to solve it on the Py3 side: have a BACKWARD compatible solution.

Right, sorry, I was thinking about the immediate problem of the data in the p-group cohomology package. Simon, maybe it's the right time to change the format of the saved data?

simon-king-jena commented 5 years ago
comment:28

Replying to @jhpalmieri:

Replying to @nbruin:

I think the main point here is to be able to solve it on the Py3 side: have a BACKWARD compatible solution.

Indeed, that's what I want to achieve here.

Right, sorry, I was thinking about the immediate problem of the data in the p-group cohomology package. Simon, maybe it's the right time to change the format of the saved data?

As I have demonstrated above, a pickle created with Python-3 can be read both with Python-2 and Python-3. So, that side of the problem isn't really urgent for the p_group_cohomology package, I think.

However, I'd like to understand

simon-king-jena commented 5 years ago
comment:29

Replying to @nbruin:

It looks like other people have run into this problem:

https://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3

Thank you! The recommended solution is

with open(mshelffile, 'rb') as f:
    d = pickle.load(f, encoding='bytes')

However, Sage's load apparently doesn't know about the encoding keyword. So, I suggest the solution to add an encoding keyword to appropriate places in sage.misc.persist; that keyword should just be ignored in Python-2 (as apparently Python-2 doesn't support it) and passed to the pickle module in Python-3.

If you agree that the approach makes sense, I'll change the ticket description accordingly.

simon-king-jena commented 5 years ago

Changed keywords from unpickling UnicodeError meataxe to unpickling UnicodeError backwards compatibility

simon-king-jena commented 5 years ago
comment:30

I just checked: When I pickle.dump('\x80\x1f', <file>) with either Python-2 or Python-3, then I can pickle.load(<file>, encoding=bytes) in Python-3 and can pickle.load(<file>) in Python-2. So, that looks like backwards and forwards compatibility to me.

Also we see in the stackoverflow thread that the problem really isn't related with our optional meataxe or p_groupcohomology modules. Therefore I remove the keyword from the ticket description.

fchapoton commented 5 years ago
comment:31

I would also suggest to understand where this non-ascii string comes from in your example pickles, and fix the responsible of that to use instead unicode strings. This '\x80\x1f' may be related to the empty-set symbol, but I am not sure.

simon-king-jena commented 5 years ago
comment:32

Replying to @fchapoton:

I would also suggest to understand where this non-ascii string comes from in your example pickles, and fix the responsible of that to use instead unicode strings.

This already is understood. '\x80\x1f' is bytes corresponding to the data of some meataxe matrix. It really is a bytes (see line 528 in src/sage/matrix/matrix_gfpn_dense.pyx! It SHOULDN'T be a unicode string. And the problem is that Python-3 tries to erroneously read it as unicode string, when it encounters it in a Python-2 pickle.

Ostensibly, it is the case that Python-2 str corresponds to Python-3 bytes, and Python-2 unicode corresponds to Python-3 str. But apparently Python-3 tries to unpickle a Python-2 str as unicode, and that's a bug (in Python, not in Sage), IMHO. Example:

king@klap:~$ ~/Sage/git/sage/sage -python
Python 2.7.15 (default, Jul 26 2019, 11:49:43) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'wb') as f:
...     pickle.dump(('\x80\x1f', u'\x80\x1f'), f)
... 
>>> 
king@klap:~$ ~/Sage/git/py3/sage -python
Python 3.7.3 (default, Aug 27 2019, 23:22:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'rb') as f:
...     pickle.load(f)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
>>> with open('bla', 'rb') as f:
...     X = pickle.load(f, encoding='bytes')
... 
>>> X
(b'\x80\x1f', '\x80\x1f')

Note that according to the stackoverflow thread, the same problem would also arise when you pickle a Python-2 dictionary whose keys are Python-2 strings, and then unpickle in Python-3. It isn't an exotic problem that is in any way caused by a Sage (optional) package.

simon-king-jena commented 5 years ago
comment:33

PS:

king@klap:~$ ~/Sage/git/py3/sage -python
Python 3.7.3 (default, Aug 27 2019, 23:22:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'wb') as f:
...     pickle.dump(('\x80\x1f', b'\x80\x1f'), f)
... 
>>> 
king@klap:~$ ~/Sage/git/py3/sage -python
Python 3.7.3 (default, Aug 27 2019, 23:22:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'rb') as f:
...     X = pickle.load(f)
... 
>>> X
('\x80\x1f', b'\x80\x1f')
>>> with open('bla', 'rb') as f:
...     X = pickle.load(f, encoding='bytes')
... 
>>> X
('\x80\x1f', b'\x80\x1f')

So, the types of a Python-3 pickle are still correct when unpickling them with the encode='bytes' keyword.

simon-king-jena commented 5 years ago
comment:34

Argh. It is of course possible to pass the encoding keyword to the pickle module. But then, things fail because of things like

    if '.' in name:
        module, name = name.rsplit('.', 1)
        all = __import__(module, fromlist=[name])

in sage.structure.factory.unpickle_global: name should be a string, but is bytes when using encoding='bytes' (whereas '.' still is a string).

jhpalmieri commented 5 years ago
comment:35

Does the function bytes_to_str (from sage.cpython.string) help? The docstring:

    Convert ``bytes`` to ``str``.

    On Python 2 this is a no-op since ``bytes is str``.  On Python 3
    this decodes the given ``bytes`` to a Python 3 unicode ``str`` using
    the specified encoding.

You could apply that to name.

simon-king-jena commented 5 years ago
comment:36

Replying to @jhpalmieri:

Does the function bytes_to_str (from sage.cpython.string) help?

Yes, but currently it seems that I am just opening a can of worms. The error that I'm getting next:

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.make_integer (build/cythonized/sage/rings/integer.c:43277)()
   7220     """
   7221     cdef Integer r = PY_NEW(Integer)
-> 7222     mpz_set_str(r.value, str_to_bytes(s), 32)
   7223     return r
   7224 

/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/cpython/string.pxd in sage.cpython.string.str_to_bytes (build/cythonized/sage/rings/integer.c:47648)()
     90     # compile-time variable. We keep the Cython wrapper to deal with
     91     # the default arguments.
---> 92     return _str_to_bytes(s, encoding, errors)

TypeError: expected str, bytes found

So, summary: In many places, python-3 really uses str in the same way as python-2 was using str. Hence, it does make sense to unpickle a python-2 str as a python-3 string. However, in other situations, we really want and need that a python-2 str unpickles as a python-3 bytes, because in python-2 str is bytes.

And that really makes everything complicated.

nbruin commented 5 years ago
comment:37

Replying to @simon-king-jena:

So, summary: In many places, python-3 really uses str in the same way as python-2 was using str. Hence, it does make sense to unpickle a python-2 str as a python-3 string. However, in other situations, we really want and need that a python-2 str unpickles as a python-3 bytes, because in python-2 str is bytes.

That would indicate we need a heuristic. We might have to patch pickle._decode_string: perhaps have another pseudo-encoding "ascii_or_bytes" which decodes via ascii to a unicode string if possible and otherwise returns a bytes object.

Another option is to decode with 'latin-1' to always get a unicode string, but it's already documented that that breaks stuff as well. So it seems we would need to try and break things in as few cases as possible ...

simon-king-jena commented 5 years ago
comment:38

Replying to @nbruin:

That would indicate we need a heuristic. We might have to patch pickle._decode_string: perhaps have another pseudo-encoding "ascii_or_bytes" which decodes via ascii to a unicode string if possible and otherwise returns a bytes object.

Another option is to decode with 'latin-1' to always get a unicode string, but it's already documented that that breaks stuff as well.

"Break" in what sense?

There are two types of problems we encounter when pickling:

  1. Unpickling of objects of basic python types results in an error, thus, unpickling aborts before a call to code from the Sage library is involved.
  2. Unpickling of objects of basic python types is faulty but without raising an error, thus, code from the Sage library is called with incorrect input.

Problems of type 1. are very bad, because unpickling fails before our Sage library code even has a chance to get things right. Therefore we should try to make it so that problems of type 1. are strictly avoided.

Problems of type 2. can at least in principal be fixed by changes to the Sage library. So, they are less critical. Let us focus on the bytes versus str problem.

If we use encoding='bytes', then all python-2 str are unpickled as bytes. And that breaks lots and lots of things, such as looking up functions in the global namespace by their function name (function names must be str, not bytes). So, I believe this is no option.

What if we use encoding='latin1'? Is it guaranteed that if b is a bytes then b.decode('latin1') yields a str (without raising an error) and that b.decode('latin1').encode('latin1') == b? I tested that it is the case at least in some example, in which other encodings fail:

sage: b = b'\x80\x1f'
sage: b.decode('utf-16').encode('utf-16') == b
False
sage: b.decode('utf-8').encode('utf-8') == b
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-130-7e0e96195e1e> in <module>()
----> 1 b.decode('utf-8').encode('utf-8') == b

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
sage: b.decode('latin1').encode('latin1') == b
True

If the answer to the question generally is "yes, it is guaranteed", then the natural approach would be: Whenever a Sage function or method expects an input of type bytes but encounters str, then the str should be encoded to a bytes with encoding 'latin1'.

The price to pay: The automatic conversion may be fine during unpickling, but may be unwanted in all other situations.

I wonder how feasible this would be.

simon-king-jena commented 5 years ago
comment:39

Replying to @simon-king-jena:

What if we use encoding='latin1'? Is it guaranteed that if b is a bytes then b.decode('latin1') yields a str (without raising an error) and that b.decode('latin1').encode('latin1') == b?

Yes!!

sage: all_bytes = bytearray(range(0x100)); all_bytes.decode('latin1').encode('latin1') == all_bytes
True

Good news.

Is there a way to temporarily (i.e., during unpickling) set the default encoding to latin1, so that cdef bytes x = s where s is a string will automatically use latin1-encoding? This may already fix some cases.

nbruin commented 5 years ago
comment:40

Replying to @simon-king-jena:

Another option is to decode with 'latin-1' to always get a unicode string, but it's already documented that that breaks stuff as well.

"Break" in what sense?

at the stackoverflow mentioned before it's mentioned that encoding='latin1' breaks unpickling of numpy arrays, which is used as motivation that encoding='bytes' would be better. That's obviously not uniformly the case, as you're encountering.

Roundtrip invariance is definitely guaranteed by latin1: it's just a 1-1 mapping between byte values and characters (in fact, the first 256 unicode code-points. If 0 can count as a code point).

Bad news: the cython documentation suggests that cython's string conversion encodings are determined (from cython 0.19) by compile-time directives.

simon-king-jena commented 5 years ago
comment:41

Replying to @nbruin:

at the stackoverflow mentioned before it's mentioned that encoding='latin1' breaks unpickling of numpy arrays, which is used as motivation that encoding='bytes' would be better. That's obviously not uniformly the case, as you're encountering.

In the thread, I see mentioned that "Using encoding = 'latin1' causes some issues when your object contains numpy arrays in it." However, I don't see an explicit example or an explanation what these issues are.

nbruin commented 5 years ago
comment:42

Replying to @simon-king-jena:

In the thread, I see mentioned that "Using encoding = 'latin1' causes some issues when your object contains numpy arrays in it." However, I don't see an explicit example or an explanation what these issues are.

One quote:

Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'.

(linking to Python issue 22005. The issue is an interesting read, displaying a similar spread of attitudes among core Python developers as we have seen on sage-devel. In the end a fix was accepted in Python, so some willingness to consider compatibility for pickles between python versions (especially 2/3) does exist.)

This could be what the advice was about (since numpy arrays in pandas applications are quite prone to containing datetimes)

The numpy.load command accepts an encoding (for passing to pickle) and enforces it to be ascii, latin1, or bytes to avoid corrupting numerical data.

There is a numpy issue about this as well, which seems to be resolved.

The main thing I seem to find here: datetime had a problem with pickle encoding=latin1 prior to 3.9. Otherwise, bytes seems to cause various problems. So from this search it seems to me going with latin1 might indeed be the better default.

simon-king-jena commented 5 years ago
comment:43

Replying to @nbruin:

One quote:

Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'. ... This could be what the advice was about (since numpy arrays in pandas applications are quite prone to containing datetimes)

The numpy.load command accepts an encoding (for passing to pickle) and enforces it to be ascii, latin1, or bytes to avoid corrupting numerical data.

There is a numpy issue about this as well, which seems to be resolved.

In a version that is in Sage?

The main thing I seem to find here: datetime had a problem with pickle encoding=latin1 prior to 3.9.

We are prior to 3.9 (the current Py-3 version is 3.8 in Sage).

Otherwise, bytes seems to cause various problems. So from this search it seems to me going with latin1 might indeed be the better default.

Of course we should have a default, but most importantly is that we allow the user to specify all optional arguments to load/dump that are accepted by pickle.load/pickle.dump. Namely, in Python-2:

Python 2.7.15 (default, Jul 26 2019, 11:49:43) 
[GCC 5.4.0 20160609] on linux2
>>> import pickle, inspect
>>> inspect.getargspec(pickle.load)
ArgSpec(args=['file'], varargs=None, keywords=None, defaults=None)
>>> inspect.getargspec(pickle.dump)
ArgSpec(args=['obj', 'file', 'protocol'], varargs=None, keywords=None, defaults=(None,))

and in Python-3:

Python 3.7.3 (default, Aug 27 2019, 23:22:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle, inspect
>>> inspect.getfullargspec(pickle.load)
FullArgSpec(args=['file'], varargs=None, varkw=None, defaults=None, kwonlyargs=['fix_imports', 'encoding', 'errors'], kwonlydefaults={'fix_imports': True, 'encoding': 'ASCII', 'errors': 'strict'}, annotations={})
>>> inspect.getfullargspec(pickle.dump)
FullArgSpec(args=['obj', 'file', 'protocol'], varargs=None, varkw=None, defaults=(None,), kwonlyargs=['fix_imports'], kwonlydefaults={'fix_imports': True}, annotations={})

I wonder if we should be explicit (i.e., def load(file, fix_imports=True, encoding='ASCII', errors='strict')), which means that the code would depend on the Python language level, or implicit (i.e., def load(file, **kwargs))?

simon-king-jena commented 5 years ago
comment:44

In this comment, I'm trying to summarise the different approaches. Please add if I omit or misrepresent something.

If we decide to have encoding with a default:

Which scenario would be the worse can of worms?I guess using strings to pickle objects is more common than using bytes (sage.matrix.matrix_gfpn_dense use bytes, but what else?). So, it seems the approach to use latin1 will be easier in the Sage library.

However, we also need to worry about non-Sage library objects that might occur as attributes of Sage library objects. Such as numpy arrays/datetime. Is there a hack to make unpickling of numpy arrays work even when latin1 is used? Or is bytes the only way out for them?