python / cpython

The Python programming language
https://www.python.org
Other
62.4k stars 29.96k forks source link

Increase pickle compatibility #57775

Closed e26428b1-70cf-4e9f-ae3c-9ef0478633fb closed 1 month ago

e26428b1-70cf-4e9f-ae3c-9ef0478633fb commented 12 years ago
BPO 13566
Nosy @terryjreedy, @pitrou, @vstinner, @avassalotti, @serhiy-storchaka, @vajrasky, @MojoVampire
Files
  • pickle_old_strings.patch
  • pickle-old-strings-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/serhiy-storchaka' closed_at = None created_at = labels = ['3.7', 'type-feature', 'library'] title = 'Increase pickle compatibility' updated_at = user = 'https://bugs.python.org/sbt' ``` bugs.python.org fields: ```python activity = actor = 'josh.r' assignee = 'serhiy.storchaka' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'sbt' dependencies = [] files = ['39348', '46703'] hgrepos = [] issue_num = 13566 keywords = ['patch'] message_count = 16.0 messages = ['149092', '149101', '149104', '205445', '206504', '206532', '206536', '242949', '245048', '245049', '245050', '245051', '245056', '245057', '288158', '289119'] nosy_count = 9.0 nosy_names = ['terry.reedy', 'pitrou', 'vstinner', 'alexandre.vassalotti', 'sbt', 'Ramchandra Apte', 'serhiy.storchaka', 'vajrasky', 'josh.r'] pr_nums = [] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue13566' versions = ['Python 3.7'] ```

    e26428b1-70cf-4e9f-ae3c-9ef0478633fb commented 12 years ago

    If you pickle an array object on python 3 the typecode is encoded as a unicode string rather than as a byte string. This makes python 2 reject the pickle.

    #########################################

    Python 3.3.0a0 (default, Dec  8 2011, 17:56:13) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pickle, array
    >>> pickle.dumps(array.array('i', [1,2,3]), 2)
    b'\x80\x02carray\narray\nq\x00X\x01\x00\x00\x00iq\x01]q\x02(K\x01K\x02K\x03e\x86q\x03Rq\x04.'

    #########################################

    Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pickle
    >>> pickle.loads(b'\x80\x02carray\narray\nq\x00X\x01\x00\x00\x00iq\x01]q\x02(K\x01K\x02K\x03e\x86q\x03Rq\x04.')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "c:\Python27\lib\pickle.py", line 1382, in loads
        return Unpickler(file).load()
      File "c:\Python27\lib\pickle.py", line 858, in load
        dispatch[key](self)
      File "c:\Python27\lib\pickle.py", line 1133, in load_reduce
        value = func(*args)
    TypeError: must be char, not unicode
    918f67d7-4fec-4a8d-93e3-6530aeb1e57e commented 12 years ago

    The problem is that pickle is calling array.array(u'i',[1,2,3]) and array.array in Python 2 doesn't allow unicode strings as a typecode (typecode is the first argument)

    The docs in Python 2 and Py3k doesn't specify the type of the typecode argument of array.array. In Python 2 it seems that typecode has to be a bytes string. In Python 3 it seems that typecode has to be a unicode string.

    I suggest that array.array be changed in Python 2 to allow unicode strings as a typecode or that pickle detects array.array being called and fixes the call.

    e26428b1-70cf-4e9f-ae3c-9ef0478633fb commented 12 years ago

    I suggest that array.array be changed in Python 2 to allow unicode strings as a typecode or that pickle detects array.array being called and fixes the call.

    Interestingly, py3 does understand arrays pickled by py2. This appears to be because py2 pickles str using BINSTRING or SHORT_BINSTRING which will unpickle as str on py2 and py3. py3 pickles str using BINUNICODE which will unpickle as unicode on py2 and str on py3.

    I think it would be better to fix this in py3 if possible, but that does not look easy: modifying array.__reduce_ex__ alone would not be enough.

    The only thing I can think of is for py3 to grow a "_binstr" type which only supports ascii strings and is special-cased by pickle to be pickled using BINSTRING. Then array.__reduce_ex__ could be something like:

      def __reduce_ex__(self, protocol):
        if protocol <= 2:
          return array.array, (_binstr(self.typecode), list(self))
        else:
          ...
    avassalotti commented 10 years ago

    Adding a special type is not a bad idea. We have to keep the code for loading BINSTRING opcodes anyway, so we might as well use it. It could be helpful for unit-testing our Python 2 compatibility support for pickle.

    We should still fix array in 2.7 to accept unicode object for the typecode though.

    ce7d5904-2ae0-4bc9-8085-2279bf8114aa commented 10 years ago

    Alexandre Vassalotti said: "We should still fix array in 2.7 to accept unicode object for the typecode though."

    I created issue bpo-20014 (with the patch) for this feature.

    serhiy-storchaka commented 10 years ago

    See bpo-20015 for more general approach.

    vstinner commented 10 years ago

    If you pickle an array object on python 3 the typecode is encoded as a unicode string rather than as a byte string. This makes python 2 reject the pickle.

    Pickles files of Python 3 are supposed to be compatible with Python 2?

    It looks very tricky to produce pickle files compatible with both versions.

    serhiy-storchaka commented 9 years ago

    Proposed patch pickles all ascii strings with protocols \< 3 and fix_import=True with compatible opcodes (STRING, BINSTRING and SHORT_BINSTRING). Pickled strings are unpickled as str in Python 2 and Python 3 (unless encoding="bytes").

    As a side effect, short ascii strings (length \< 256) are pickled more compact with protocols \< 3.

    serhiy-storchaka commented 9 years ago

    Alexandre, Antoine, what are your thoughts?

    pitrou commented 9 years ago

    Won't that fail if a Python 2 API accepts only unicode strings?

    serhiy-storchaka commented 9 years ago

    Does such API even exist?

    pitrou commented 9 years ago

    I wouldn't be very surprised if third-party libraries enforce such typing, yes. If your library has a clear text/bytes separation, it makes sense to enforce it at the API level, to avoid mistakes by users.

    serhiy-storchaka commented 9 years ago

    Such libraries already have a problem. Both str and unicode pickled in Python 2 are unpickled as str in Python 3.

    pitrou commented 9 years ago

    It's not a problem, since str *is* unicode in Python 3.

    serhiy-storchaka commented 7 years ago

    This is a problem when pickle data in Python 3 for unpickling in Python 2.

    99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 7 years ago

    Right, but Antoine's objection is that suddenly strs pickled in Py3 can end up as strs in Py2, rather than unicode. If the library enforces a Py3-like type separation on Py2 (text arguments are unicode only, binary data is str only), then you have the problem where pickling on Py3 produces a pickle that will unpickle as str on Py2, and suddenly the library explodes because the argument, that should be unicode on Py2 and str on Py3, is suddenly str on both.

    This means that, to fix a problem with non-forward compatible libraries (that accept text only as Py2 str), a Py2 library that's (very) forward thinking would have problems.

    Admittedly, I wouldn't expect there to be very many such libraries, and many of them would have their own custom pickle formats, but stuff like numpy is quite sensitive to argument type; numpy.array(u'123') and numpy.array(b'123') are different. In numpy's case, each of those produces a derived datatype that is explicitly pickled and (I believe) would prevent the error, but some other more heuristic library might not do so.

    serhiy-storchaka commented 1 month ago

    Since Python 2 has reached its EOL, this issue is no longer actual.