Closed simon-king-jena closed 5 years ago
File that cannot be unpickled in Python-3
Attachment: State.sobj.gz
Can Sage/Py3 produce the pickle? In that case, you could compare the produced pickles to see how far apart they are. Of course, if an ASCII decoder encounters 0x80 it's justified to not decode it, so it might be interesting to see what py3 makes from it itself. My guess would be that the bytestring should NOT be decoded by ascii, but something else. Perhaps unpickle can be configured to use a different decoder. But it would be good to see what generates the non-ascii symbol and what its meaning is.
Replying to @nbruin:
Of course, if an ASCII decoder encounters 0x80 it's justified to not decode it
Then the same should hold for Python-2. It doesn't. Hence, it shouldn't hold for Python-3 either.
Replying to @nbruin:
Can Sage/Py3 produce the pickle?
The problem is that the pickle comes from an old version of an optional Sage package. That's why I use the "unpickle override". But I'll see what I can do.
The new attachment was created with Python-2 and can be used without "unpickle override", but it requires the optional meataxe spkg.
Pickle of MeatAxe in Python-2
I think now I have a very small example. It uses the optional meataxe package. It would be a good news if actually the meataxe wrapper was to blame for the unpickling problem --- but I am not expert enough to tell whether (1) it is the case and (2) how it could be fixed (if it was the case).
Anyway. I have a new version of attachment: Py2.sobj, and a new attachment: Py3.sobj. It results in the following behaviour in Python-3
sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-3-5705b555470a> in <module>()
----> 1 load('/home/king/Projekte/coho/tests/Py2.sobj')
/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)()
149
150 ## Load file by absolute filename
--> 151 with open(filename, 'rb') as fobj:
152 X = loads(fobj.read(), compress=compress)
153 try:
/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2774)()
150 ## Load file by absolute filename
151 with open(filename, 'rb') as fobj:
--> 152 X = loads(fobj.read(), compress=compress)
153 try:
154 X._default_filename = os.path.abspath(filename)
/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7270)()
967
968 unpickler = SageUnpickler(io.BytesIO(s))
--> 969 return unpickler.load()
970
971
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
and in Python-2
sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: load('/home/king/Projekte/coho/tests/Py3.sobj')
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
sage: __ == _
True
So, the Python-3 pickle can be unpickled in Python-2, but not the other way around. What is the problem?
Note that the Python-3 pickle is as much as 25% larger than the Python-2 pickle. Is that regression typical?
Description changed:
---
+++
@@ -1,16 +1,16 @@
-The following happens in Sage with Python-3, when trying to unpickle the attached file `State.sobj`:
+EDIT: I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.
+
+
+The following examples require the optional meataxe package, but I am not sure yet if meataxe is to blame or Python-3 (I hope it is the former, because I guess it would be more easy to fix).
+
+[attachment: Py2.sobj](https://github.com/sagemath/sage/files/ticket28444/a2c0b4ca4b28d693a6c86ba0e2532b24.gz) and [attachment: Py3.sobj](https://github.com/sagemath/sage/files/ticket28444/c19ee254b667c7a2b06c5025e4a8fcb1.gz) result in the following behaviour in Python-3
UnicodeDecodeError Traceback (most recent call last)
-
/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)() 149 @@ -34,95 +34,20 @@ 971
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128) +sage: load('/home/king/Projekte/coho/tests/Py3.sobj') +[1 0 0 0 0 0 0 0] +[0 0 0 1 1 1 1 1]
-
-Doing the same in Sage-with-Python-2, one gets
+and in Python-2
-sage: class unpickle_old_mtx:
-....: def call(self, *args, **kwds):
-....: return None
-....:
-sage: register_unpickle_override('pGroupCohomology.mtx', 'MTX_unpickle_class', unpickle_old_mtx)
-sage: X = load('/home/king/Projekte/coho/tests/State.sobj')
-sage: X
-([[1,
-I believe that a backwards incompatible change of pickling is a blocker for Python-3 support. +So, the Python-3 pickle can be unpickled in Python-2, but not the other way around. What is the problem?
Changed keywords from unpickling UnicodeError to unpickling UnicodeError meataxe
This is in no way a blocker, IMHO.
Replying to @fchapoton:
This is in no way a blocker, IMHO.
If it is due to meataxe (an optional package), then it is not a blocker. If it is due to the upcoming switch to Python-3, then IMHO it is a blocker to that switch (not to a Python-2 version of Sage, though). Since currently it isn't clear if the example reveals a problem in Python-3 or not, I'd say better safe than sorry.
So we agree that this is not a blocker for the upcoming 8.9 release, still py2. This is the usual meaning of blocker. But in this time of transition, we must be clearer about what blocker means.
Description changed:
---
+++
@@ -1,4 +1,4 @@
-EDIT: I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.
+EDIT: In the original ticket description, I stated: "I believe that a backwards incompatible change of pickling is a blocker for Python-3 support." In that (and ONLY in that) sense I believe this ticket is a blocker. I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.
The following examples require the optional meataxe package, but I am not sure yet if meataxe is to blame or Python-3 (I hope it is the former, because I guess it would be more easy to fix).
Replying to @fchapoton:
So we agree that this is not a blocker for the upcoming 8.9 release, still py2. This is the usual meaning of blocker. But in this time of transition, we must be clearer about what blocker means.
In the original ticket description, I told in what sense I believe it was a blocker, but I somehow deleted that clarification when I changed the ticket description. Now, the statement is back, at the top of the ticket description.
could you provide the needed register unpickle override ?
Trying to load with py3, I stumble on
ModuleNotFoundError: No module named 'sage.matrix.matrix_gfpn_dense'
Replying to @fchapoton:
could you provide the needed register unpickle override ?
Trying to load with py3, I stumble on
ModuleNotFoundError: No module named 'sage.matrix.matrix_gfpn_dense'
As stated in the ticket description, the pickle is supposed to be loadable with the optional meataxe package (followed by sage -b
) installed. However, I believe (but cannot test it, as I do not have Sage without meataxe) that the following register unpickle override would work:
sage: def unpickler(*args, **kwds):
....: return None
....:
sage: register_unpickle_override('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle', unpickler)
sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
I suppose the "load()" command will return None in Python-2 but result in an error with Python-3 (even without meataxe installed).
using your empty unpickler, I still get the same ModuleNotFoundError
Replying to @fchapoton:
using you empty unpickler, I still get the same
ModuleNotFoundError
I guess the matrix space is a matrix space with implementation=meataxe
, and the parent of a matrix is also part of the pickle.
So, we'll override unpickling the matrix space as well:
sage: def MS_unpickler(*args, **kwds):
....: return MatrixSpace(*(args[:4]),**kwds)
....:
sage: def mtx_unpickler(*args, **kwds):
....: return None
....:
sage: register_unpickle_override('sage.matrix.matrix_space', 'MatrixSpace', MS_unpickler)
sage: register_unpickle_override('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle', mtx_unpickler)
sage: load('/home/king/Projekte/coho/tests/Py2.sobj')
Same thing with the new unpicklers proposal. I tried something else:
sage: explain_pickle(open('Py2.sobj', 'rb').read())
pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
si = pg_make_integer('2')
pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, '\x80\x1f', True)
sage: explain_pickle(open('Py3.sobj', 'rb').read())
pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
si = pg_make_integer('2')
pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
pg_encode = unpickle_global('_codecs', 'encode')
pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, pg_encode('\x80\x1f', 'latin1'), True)
Replying to @fchapoton:
Same thing with the new unpicklers proposal.
Can you try to add another line:
sage: register_unpickle_override('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense', type)
Replying to @fchapoton:
I tried something else:
sage: explain_pickle(open('Py2.sobj', 'rb').read()) pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle') pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce') pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace') pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle') pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global') pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer') si = pg_make_integer('2') pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense') pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, '\x80\x1f', True) sage: explain_pickle(open('Py3.sobj', 'rb').read()) pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle') pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce') pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace') pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle') pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global') pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer') si = pg_make_integer('2') pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense') pg_encode = unpickle_global('_codecs', 'encode') pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, pg_encode('\x80\x1f', 'latin1'), True)
Do I see correctly that the only difference is the line
pg_encode = unpickle_global('_codecs', 'encode')
that is in Python-3 but not in Python-2? What are the implications?
There is another difference at the end of the last line of the explain pickles, that contains 0x80..
With the third "register_unpickle, I now get
sage: load('Py3.sobj')
works ie returns None. And
sage: load('Py2.sobj')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-38-45fcfa858fc2> in <module>()
----> 1 load('Py2.sobj')
/home/chapoton/sage3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2824)()
149
150 ## Load file by absolute filename
--> 151 with open(filename, 'rb') as fobj:
152 X = loads(fobj.read(), compress=compress)
153 try:
/home/chapoton/sage3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.load (build/cythonized/sage/misc/persist.c:2774)()
150 ## Load file by absolute filename
151 with open(filename, 'rb') as fobj:
--> 152 X = loads(fobj.read(), compress=compress)
153 try:
154 X._default_filename = os.path.abspath(filename)
/home/chapoton/sage3/local/lib/python3.7/site-packages/sage/misc/persist.pyx in sage.misc.persist.loads (build/cythonized/sage/misc/persist.c:7270)()
967
968 unpickler = SageUnpickler(io.BytesIO(s))
--> 969 return unpickler.load()
970
971
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
Replying to @fchapoton:
There is another difference at the end of the last line of the explain pickles, that contains 0x80..
Right.
Do I understand correctly, that the string with 0x80 is swallowed without problem by Python-2, both with and without explicit encoding, while Python-3 will swallow it ONLY with an explicit encoding?
Then: How to make it so that temporarily the unpickler assumes a default encoding?
With the third "register_unpickle, I now get ...
Hooray! So, it is now possible to investigate the core problem.
Aha! I guess the problem is indeed that the string passed to the matrix constructor is supposed to be interpreted as bytes. Namely:
sage: pg_mtx_unpickle = unpickle_global('sage.matrix.matrix_gfpn_dense', 'mtx_unpickle')
....: pg_unreduce = unpickle_global('sage.structure.unique_representation', 'unreduce')
....: pg_MatrixSpace = unpickle_global('sage.matrix.matrix_space', 'MatrixSpace')
....: pg_generic_factory_unpickle = unpickle_global('sage.structure.factory', 'generic_factory_unpickle')
....: pg_lookup_global = unpickle_global('sage.structure.factory', 'lookup_global')
....: pg_make_integer = unpickle_global('sage.rings.integer', 'make_integer')
....: si = pg_make_integer('2')
....: pg_Matrix_gfpn_dense = unpickle_global('sage.matrix.matrix_gfpn_dense', 'Matrix_gfpn_dense')
sage: pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField
....: '), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {
....: }), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, '\x80\x1f', True)
....:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-5506f597954b> in <module>()
----> 1 pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField'), (8, 9, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {}), 2, 8, False, pg_Matrix_gfpn_dense), {}), 2, 8, '\x80\x1f', True)
TypeError: Argument 'Data' has incorrect type (expected bytes, got str)
sage: pg_mtx_unpickle(pg_unreduce(pg_MatrixSpace, (pg_generic_factory_unpickle(pg_lookup_global('FiniteField
....: '), (8r, 9r, 'beta8'), (si, ('x',), None, 'modn', si, pg_make_integer('1'), True, None, None, None), {
....: }), 2r, 8r, False, pg_Matrix_gfpn_dense), {}), 2r, 8r, b'\x80\x1f', True)
....:
[1 0 0 0 0 0 0 0]
[0 0 0 1 1 1 1 1]
(of course, this is with meataxe installed).
So, is it possible to automatically load a Python-2 string as a Python-3 bytes, as if one simply puts a b
in front of the string (that's what I did in the last line of the above example)?
Happy to see progress. But sorry, I will now turn offline..
Everything boils down to the following:
In Python-2, do
sage: X = '\x80\x1f'
sage: save(X, 'Py2_string.sobj')
In Python-3, you'll get
sage: load('/home/king/Projekte/coho/tests/Py2_string.sobj')
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
So, it is really the question: How can I (temporarily) force Python-3 to interpret a pickled string as bytes?
Replying to @simon-king-jena:
So, it is really the question: How can I (temporarily) force Python-3 to interpret a pickled string as bytes?
It looks like other people have run into this problem:
https://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3
Replying to @nbruin:
Replying to @simon-king-jena:
So, it is really the question: How can I (temporarily) force Python-3 to interpret a pickled string as bytes?
It looks like other people have run into this problem:
https://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3
I was wondering about one of the approaches mentioned there: within Python 2, unpickle the data, then save it in a format which can be read by Python 3, or ideally, read by both Python 2 and 3.
Replying to @jhpalmieri:
I was wondering about one of the approaches mentioned there: within Python 2, unpickle the data, then save it in a format which can be read by Python 3, or ideally, read by both Python 2 and 3.
I think the main point here is to be able to solve it on the Py3 side: have a BACKWARD compatible solution. If we can work on both sides, there are plenty of work-arounds, for instance, write the data as JSON and parse that on the other side.
That said, I think this is another example that shows that we should be serious about keeping VMs of a reasonable spread of sage versions -- people might end up needing them for data archaeology. (do we do this already? I know VMs get produced for a lot of releases; archiving them would just be a matter of resources.
Replying to @nbruin:
Replying to @jhpalmieri:
I was wondering about one of the approaches mentioned there: within Python 2, unpickle the data, then save it in a format which can be read by Python 3, or ideally, read by both Python 2 and 3.
I think the main point here is to be able to solve it on the Py3 side: have a BACKWARD compatible solution.
Right, sorry, I was thinking about the immediate problem of the data in the p-group cohomology package. Simon, maybe it's the right time to change the format of the saved data?
Replying to @jhpalmieri:
Replying to @nbruin:
I think the main point here is to be able to solve it on the Py3 side: have a BACKWARD compatible solution.
Indeed, that's what I want to achieve here.
Right, sorry, I was thinking about the immediate problem of the data in the p-group cohomology package. Simon, maybe it's the right time to change the format of the saved data?
As I have demonstrated above, a pickle created with Python-3 can be read both with Python-2 and Python-3. So, that side of the problem isn't really urgent for the p_group_cohomology package, I think.
However, I'd like to understand
why pickling bytes in Python-3 is apparently more involved than pickling a string, even though to my understanding a string is something more complicated, as it needs to be interpreted by some encoding. That's to say: Why is b'\x80\x1f'
not pickled as b'\x80\x1f'
but as
pg_encode = unpickle_global('_codecs', 'encode')
pg_encode('\x80\x1f', 'latin1')
My expectation actually was that bytes
is a more basic data type than str
, as the latter needs to know how it is interpreted (utf-8? isolatin-1? etc.)
Replying to @nbruin:
It looks like other people have run into this problem:
https://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3
Thank you! The recommended solution is
with open(mshelffile, 'rb') as f:
d = pickle.load(f, encoding='bytes')
However, Sage's load
apparently doesn't know about the encoding keyword. So, I suggest the solution to add an encoding
keyword to appropriate places in sage.misc.persist
; that keyword should just be ignored in Python-2 (as apparently Python-2 doesn't support it) and passed to the pickle
module in Python-3.
If you agree that the approach makes sense, I'll change the ticket description accordingly.
Changed keywords from unpickling UnicodeError meataxe to unpickling UnicodeError backwards compatibility
I just checked: When I pickle.dump('\x80\x1f', <file>)
with either Python-2 or Python-3, then I can pickle.load(<file>, encoding=bytes)
in Python-3 and can pickle.load(<file>)
in Python-2. So, that looks like backwards and forwards compatibility to me.
Also we see in the stackoverflow thread that the problem really isn't related with our optional meataxe or p_groupcohomology modules. Therefore I remove the keyword from the ticket description.
I would also suggest to understand where this non-ascii string comes from in your example pickles, and fix the responsible of that to use instead unicode strings. This '\x80\x1f' may be related to the empty-set symbol, but I am not sure.
Replying to @fchapoton:
I would also suggest to understand where this non-ascii string comes from in your example pickles, and fix the responsible of that to use instead unicode strings.
This already is understood. '\x80\x1f' is bytes
corresponding to the data of some meataxe matrix. It really is a bytes (see line 528 in src/sage/matrix/matrix_gfpn_dense.pyx! It SHOULDN'T be a unicode string. And the problem is that Python-3 tries to erroneously read it as unicode string, when it encounters it in a Python-2 pickle.
Ostensibly, it is the case that Python-2 str corresponds to Python-3 bytes, and Python-2 unicode corresponds to Python-3 str. But apparently Python-3 tries to unpickle a Python-2 str as unicode, and that's a bug (in Python, not in Sage), IMHO. Example:
king@klap:~$ ~/Sage/git/sage/sage -python
Python 2.7.15 (default, Jul 26 2019, 11:49:43)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'wb') as f:
... pickle.dump(('\x80\x1f', u'\x80\x1f'), f)
...
>>>
king@klap:~$ ~/Sage/git/py3/sage -python
Python 3.7.3 (default, Aug 27 2019, 23:22:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'rb') as f:
... pickle.load(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
>>> with open('bla', 'rb') as f:
... X = pickle.load(f, encoding='bytes')
...
>>> X
(b'\x80\x1f', '\x80\x1f')
Note that according to the stackoverflow thread, the same problem would also arise when you pickle a Python-2 dictionary whose keys are Python-2 strings, and then unpickle in Python-3. It isn't an exotic problem that is in any way caused by a Sage (optional) package.
PS:
king@klap:~$ ~/Sage/git/py3/sage -python
Python 3.7.3 (default, Aug 27 2019, 23:22:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'wb') as f:
... pickle.dump(('\x80\x1f', b'\x80\x1f'), f)
...
>>>
king@klap:~$ ~/Sage/git/py3/sage -python
Python 3.7.3 (default, Aug 27 2019, 23:22:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('bla', 'rb') as f:
... X = pickle.load(f)
...
>>> X
('\x80\x1f', b'\x80\x1f')
>>> with open('bla', 'rb') as f:
... X = pickle.load(f, encoding='bytes')
...
>>> X
('\x80\x1f', b'\x80\x1f')
So, the types of a Python-3 pickle are still correct when unpickling them with the encode='bytes'
keyword.
Argh. It is of course possible to pass the encoding
keyword to the pickle module. But then, things fail because of things like
if '.' in name:
module, name = name.rsplit('.', 1)
all = __import__(module, fromlist=[name])
in sage.structure.factory.unpickle_global
: name
should be a string, but is bytes when using encoding='bytes'
(whereas '.' still is a string).
Does the function bytes_to_str
(from sage.cpython.string
) help? The docstring:
Convert ``bytes`` to ``str``.
On Python 2 this is a no-op since ``bytes is str``. On Python 3
this decodes the given ``bytes`` to a Python 3 unicode ``str`` using
the specified encoding.
You could apply that to name
.
Replying to @jhpalmieri:
Does the function
bytes_to_str
(fromsage.cpython.string
) help?
Yes, but currently it seems that I am just opening a can of worms. The error that I'm getting next:
/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/rings/integer.pyx in sage.rings.integer.make_integer (build/cythonized/sage/rings/integer.c:43277)()
7220 """
7221 cdef Integer r = PY_NEW(Integer)
-> 7222 mpz_set_str(r.value, str_to_bytes(s), 32)
7223 return r
7224
/home/king/Sage/git/py3/local/lib/python3.7/site-packages/sage/cpython/string.pxd in sage.cpython.string.str_to_bytes (build/cythonized/sage/rings/integer.c:47648)()
90 # compile-time variable. We keep the Cython wrapper to deal with
91 # the default arguments.
---> 92 return _str_to_bytes(s, encoding, errors)
TypeError: expected str, bytes found
So, summary: In many places, python-3 really uses str in the same way as python-2 was using str. Hence, it does make sense to unpickle a python-2 str as a python-3 string. However, in other situations, we really want and need that a python-2 str unpickles as a python-3 bytes, because in python-2 str is bytes
.
And that really makes everything complicated.
Replying to @simon-king-jena:
So, summary: In many places, python-3 really uses str in the same way as python-2 was using str. Hence, it does make sense to unpickle a python-2 str as a python-3 string. However, in other situations, we really want and need that a python-2 str unpickles as a python-3 bytes, because in python-2
str is bytes
.
That would indicate we need a heuristic. We might have to patch pickle._decode_string: perhaps have another pseudo-encoding "ascii_or_bytes" which decodes via ascii to a unicode string if possible and otherwise returns a bytes object.
Another option is to decode with 'latin-1' to always get a unicode string, but it's already documented that that breaks stuff as well. So it seems we would need to try and break things in as few cases as possible ...
Replying to @nbruin:
That would indicate we need a heuristic. We might have to patch pickle._decode_string: perhaps have another pseudo-encoding "ascii_or_bytes" which decodes via ascii to a unicode string if possible and otherwise returns a bytes object.
Another option is to decode with 'latin-1' to always get a unicode string, but it's already documented that that breaks stuff as well.
"Break" in what sense?
There are two types of problems we encounter when pickling:
Problems of type 1. are very bad, because unpickling fails before our Sage library code even has a chance to get things right. Therefore we should try to make it so that problems of type 1. are strictly avoided.
Problems of type 2. can at least in principal be fixed by changes to the Sage library. So, they are less critical. Let us focus on the bytes versus str problem.
If we use encoding='bytes'
, then all python-2 str are unpickled as bytes. And that breaks lots and lots of things, such as looking up functions in the global namespace by their function name (function names must be str, not bytes). So, I believe this is no option.
What if we use encoding='latin1'
? Is it guaranteed that if b
is a bytes then b.decode('latin1')
yields a str (without raising an error) and that b.decode('latin1').encode('latin1') == b
? I tested that it is the case at least in some example, in which other encodings fail:
sage: b = b'\x80\x1f'
sage: b.decode('utf-16').encode('utf-16') == b
False
sage: b.decode('utf-8').encode('utf-8') == b
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-130-7e0e96195e1e> in <module>()
----> 1 b.decode('utf-8').encode('utf-8') == b
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
sage: b.decode('latin1').encode('latin1') == b
True
If the answer to the question generally is "yes, it is guaranteed", then the natural approach would be: Whenever a Sage function or method expects an input of type bytes but encounters str, then the str should be encoded to a bytes with encoding 'latin1'.
The price to pay: The automatic conversion may be fine during unpickling, but may be unwanted in all other situations.
I wonder how feasible this would be.
Replying to @simon-king-jena:
What if we use
encoding='latin1'
? Is it guaranteed that ifb
is a bytes thenb.decode('latin1')
yields a str (without raising an error) and thatb.decode('latin1').encode('latin1') == b
?
Yes!!
sage: all_bytes = bytearray(range(0x100)); all_bytes.decode('latin1').encode('latin1') == all_bytes
True
Good news.
Is there a way to temporarily (i.e., during unpickling) set the default encoding to latin1
, so that cdef bytes x = s
where s
is a string will automatically use latin1-encoding? This may already fix some cases.
Replying to @simon-king-jena:
Another option is to decode with 'latin-1' to always get a unicode string, but it's already documented that that breaks stuff as well.
"Break" in what sense?
at the stackoverflow mentioned before it's mentioned that encoding='latin1'
breaks unpickling of numpy arrays, which is used as motivation that encoding='bytes'
would be better. That's obviously not uniformly the case, as you're encountering.
Roundtrip invariance is definitely guaranteed by latin1: it's just a 1-1 mapping between byte values and characters (in fact, the first 256 unicode code-points. If 0 can count as a code point).
Bad news: the cython documentation suggests that cython's string conversion encodings are determined (from cython 0.19) by compile-time directives.
Replying to @nbruin:
at the stackoverflow mentioned before it's mentioned that
encoding='latin1'
breaks unpickling of numpy arrays, which is used as motivation thatencoding='bytes'
would be better. That's obviously not uniformly the case, as you're encountering.
In the thread, I see mentioned that "Using encoding = 'latin1' causes some issues when your object contains numpy arrays in it." However, I don't see an explicit example or an explanation what these issues are.
Replying to @simon-king-jena:
In the thread, I see mentioned that "Using encoding = 'latin1' causes some issues when your object contains numpy arrays in it." However, I don't see an explicit example or an explanation what these issues are.
One quote:
Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'.
(linking to Python issue 22005. The issue is an interesting read, displaying a similar spread of attitudes among core Python developers as we have seen on sage-devel. In the end a fix was accepted in Python, so some willingness to consider compatibility for pickles between python versions (especially 2/3) does exist.)
This could be what the advice was about (since numpy arrays in pandas applications are quite prone to containing datetimes)
The numpy.load command accepts an encoding (for passing to pickle) and enforces it to be ascii
, latin1
, or bytes
to avoid corrupting numerical data.
There is a numpy issue about this as well, which seems to be resolved.
The main thing I seem to find here: datetime
had a problem with pickle encoding=latin1
prior to 3.9. Otherwise, bytes
seems to cause various problems. So from this search it seems to me going with latin1
might indeed be the better default.
Replying to @nbruin:
One quote:
Note that up to Python versions before 3.6.8, 3.7.2 and 3.8.0, unpickling of Python 2 datetime object data is broken unless you use encoding='bytes'. ... This could be what the advice was about (since numpy arrays in pandas applications are quite prone to containing datetimes)
The numpy.load command accepts an encoding (for passing to pickle) and enforces it to be
ascii
,latin1
, orbytes
to avoid corrupting numerical data.There is a numpy issue about this as well, which seems to be resolved.
In a version that is in Sage?
The main thing I seem to find here:
datetime
had a problem with pickleencoding=latin1
prior to 3.9.
We are prior to 3.9 (the current Py-3 version is 3.8 in Sage).
Otherwise,
bytes
seems to cause various problems. So from this search it seems to me going withlatin1
might indeed be the better default.
Of course we should have a default, but most importantly is that we allow the user to specify all optional arguments to load/dump
that are accepted by pickle.load/pickle.dump
. Namely, in Python-2:
Python 2.7.15 (default, Jul 26 2019, 11:49:43)
[GCC 5.4.0 20160609] on linux2
>>> import pickle, inspect
>>> inspect.getargspec(pickle.load)
ArgSpec(args=['file'], varargs=None, keywords=None, defaults=None)
>>> inspect.getargspec(pickle.dump)
ArgSpec(args=['obj', 'file', 'protocol'], varargs=None, keywords=None, defaults=(None,))
and in Python-3:
Python 3.7.3 (default, Aug 27 2019, 23:22:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle, inspect
>>> inspect.getfullargspec(pickle.load)
FullArgSpec(args=['file'], varargs=None, varkw=None, defaults=None, kwonlyargs=['fix_imports', 'encoding', 'errors'], kwonlydefaults={'fix_imports': True, 'encoding': 'ASCII', 'errors': 'strict'}, annotations={})
>>> inspect.getfullargspec(pickle.dump)
FullArgSpec(args=['obj', 'file', 'protocol'], varargs=None, varkw=None, defaults=(None,), kwonlyargs=['fix_imports'], kwonlydefaults={'fix_imports': True}, annotations={})
I wonder if we should be explicit (i.e., def load(file, fix_imports=True, encoding='ASCII', errors='strict')
), which means that the code would depend on the Python language level, or implicit (i.e., def load(file, **kwargs)
)?
In this comment, I'm trying to summarise the different approaches. Please add if I omit or misrepresent something.
Sage's pickling and unpickling functions (save, load, dumps, loads) should accept optional arguments. These should be passed to SageUnpickler.__init__
and SagePickler.__init__
; I think this is uncontroversial. Only question: explicit or implicit, as asked in comment:43.
Should the encoding
keyword get a default in Sage? Actually I am not sure if it should! But probably it should, and we should consider it a bug if a Sage object pickled with Python-2 cannot be unpickled with Python-3.
If we decide to have encoding
with a default:
encoding='bytes'
means that unpickling will fail as soon as a function is looked up in some module, because the look-up is by function name, and the name is a str
, not bytes
. So, bytes
is a very bad default, unless we modify sage.structure.factory.lookup_global
so that it also accepts bytes
in lieu of str
. When this is fixed, each Sage function expecting a str
needs to be changed so that it also accepts bytes
and transforms it to str
with encoding "latin1" (?), case-by-case.encoding='latin1'
will make it so that always the unpickler will be in a position to look up globals. However, each Sage function expecting a bytes
needs to be changed so that it also accepts str
and transforms it to bytes
with encoding "latin1".Which scenario would be the worse can of worms?I guess using strings to pickle objects is more common than using bytes (sage.matrix.matrix_gfpn_dense
use bytes, but what else?). So, it seems the approach to use latin1
will be easier in the Sage library.
However, we also need to worry about non-Sage library objects that might occur as attributes of Sage library objects. Such as numpy arrays/datetime. Is there a hack to make unpickling of numpy arrays work even when latin1 is used? Or is bytes the only way out for them?
EDIT: In the original ticket description, I stated: "I believe that a backwards incompatible change of pickling is a blocker for Python-3 support." In that (and ONLY in that) sense I believe this ticket is a blocker. I replaced the original ticket description by something that I wrote in a comment, because now I have a much smaller example, and moreover pickles of the same object created with Python-3 and with Python-2, so that one can compare.
The following examples require the optional meataxe package, but I am not sure yet if meataxe is to blame or Python-3 (I hope it is the former, because I guess it would be more easy to fix).
attachment: Py2.sobj and attachment: Py3.sobj result in the following behaviour in Python-3
and in Python-2
So, the Python-3 pickle can be unpickled in Python-2, but not the other way around. What is the problem?
Component: python3
Keywords: unpickling UnicodeError backwards compatibility
Author: Simon King
Branch:
d7f170f
Reviewer: Nils Bruin
Issue created by migration from https://trac.sagemath.org/ticket/28444