python / cpython

The Python programming language
https://www.python.org
Other
63.04k stars 30.19k forks source link

`pickle.loads` will crash with self-references inside a custom hash function #124937

Open charles-cooper opened 2 weeks ago

charles-cooper commented 2 weeks ago

Bug report

Bug description:

here is a reproduction of the issue:

import pickle

class Foo:
    def __init__(self):
        self.x: object = {self}

    def __hash__(self):
        return hash(self.x)

foo = Foo()

print(pickle.loads(pickle.dumps(foo)))

running this will result in the following exception:

Traceback (most recent call last):
  File "/home/charles/vyper/foo.py", line 10, in <module>
    foo = Foo()
          ^^^^^
  File "/home/charles/vyper/foo.py", line 5, in __init__
    self.x: object = {self}
                     ^^^^^^
  File "/home/charles/vyper/foo.py", line 8, in __hash__
    return hash(self.x)
                ^^^^^^
AttributeError: 'Foo' object has no attribute 'x'

a workaround to the issue has been described at https://stackoverflow.com/a/44888113. however, i consider this a bug in the cpython implementation, because pickle theoretically handles object cycles (e.g., replacing line 5 with self.x = [self] poses no problem to the unpickler).

i suspect that cpython rehashes all items when reconstructing a dict or set, which makes the issue even more problematic, e.g. if the hash function has any side-effects, they will be executed by the unpickler.

build info:

$ python
Python 3.11.10 (main, Sep  7 2024, 18:35:41) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

CPython versions tested on:

3.11

Operating systems tested on:

Linux

JelleZijlstra commented 2 weeks ago

Your stack trace shows the code throws an exception before it ever gets to pickle. Do you have a different example that crashes in pickle code?

charles-cooper commented 2 weeks ago

ah yea, sorry

import pickle

class Foo:
    def __init__(self):
        self.x = None

    def __hash__(self):
        return hash(self.x)

foo = Foo()
foo.x = {foo}

print(pickle.loads(pickle.dumps(foo)))

=>

Traceback (most recent call last):
  File "/home/charles/vyper/foo.py", line 13, in <module>
    print(pickle.loads(pickle.dumps(foo)))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/charles/vyper/foo.py", line 8, in __hash__
    return hash(self.x)
                ^^^^^^
AttributeError: 'Foo' object has no attribute 'x'
ZeroIntensity commented 2 weeks ago

Hmm, this might be a wontfix. In your reproducer, foo itself is not hashable:

>>> hash(foo)
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    hash(foo)
    ~~~~^^^^^
  File "<python-input-0>", line 8, in __hash__
    return hash(self.x)
TypeError: unhashable type: 'set'

This is pretty much expected, because your hash function is recursive. The calls are foo.__hash__ -> set.__hash__ -> foo.__hash__ again, and so on!

There's not much we can do to prevent you from writing recursive hash functions. But, maybe there's a better solution here. How are you doing this in practice?

ZeroIntensity commented 2 weeks ago

Nevermind, it seems we don't support hashing sets at all: hash({1}) results in a TypeError. I wouldn't expect pickle to be able to handle something unsupported any better.

charles-cooper commented 2 weeks ago

Nevermind, it seems we don't support hashing sets at all: hash({1}) results in a TypeError. I wouldn't expect pickle to be able to handle something unsupported any better.

i don't think it really matters if the hash function works or not. for example if i replace the cycle using a set with a frozenset, which is hashable, i can get the same error:

import pickle

class Foo:
    def __init__(self):
        self.x = None#{1}

    def __hash__(self):
        return hash(self.x)

foo = Foo()
foo.x = frozenset([foo])

print(hash(foo))  # valid

print(pickle.loads(pickle.dumps(foo)).x)
charles-cooper commented 2 weeks ago

This is pretty much expected, because your hash function is recursive. The calls are foo.__hash__ -> set.__hash__ -> foo.__hash__ again, and so on!

the issue is still there even if the hash function is not recursive. here is another variant to demonstrate:

import pickle

class Foo:
    def __init__(self):
        self.x = set()
        self.y = 5  # pretend it is immutable

    def __hash__(self):
        return hash(self.y)

foo = Foo()
foo.x.add(foo)

print(hash(foo))  # valid

print(pickle.loads(pickle.dumps(foo)).x)

=>

  File "/home/charles/vyper/foo.py", line 11, in __hash__
    return hash(self.y)
                ^^^^^^
AttributeError: 'Foo' object has no attribute 'y'
ZeroIntensity commented 2 weeks ago

It's very difficult to serialize circular references. For example, json has protection against this:

>>> import json
>>> x = {1: 2}
>>> x["x"] = x
>>> json.dumps(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
ValueError: Circular reference detected

The hashing is still technically recursive in your reproducer. While the __hash__ itself is not recursive, pickle has to serialize every attribute, and the hash function for foo.x is still recursive. (To be fair, I'm not sure that the hash function itself matters, but the point is that pickle doesn't have a good way to serialize it.)

But maybe this is a bug. I'm not 100% certain that we don't support circular references with pickle, so if we're supposed to, then it is an issue. It seems like you aren't the first person to encounter this, see this SO question.

Interestingly, this does seem to work when the hash function isn't custom:

>>> class Foo:
...     def __init__(self):
...         self.x = set()
... 
>>> foo = Foo()
>>> foo.x.add(foo)
>>> pickle.loads(pickle.dumps(foo)).x
{<__main__.Foo object at 0x7a504f893a40>}

In any event, it's probably worth making the error nicer if we really want to not support this.

charles-cooper commented 2 weeks ago

It's very difficult to serialize circular references. For example, json has protection against this:

pickle absolutely serializes circular references. as i pointed out in the issue description, a list works fine here:

replacing line 5 with self.x = [self] poses no problem to the unpickler).

pickle's ability to handle cycles is documented, part of the pickle format design, and clear from the implementation, e.g. https://github.com/python/cpython/blob/1f9025a4e7819bb4f7799784710f0f3750a9ca31/Lib/pickle.py#L987-L999

i've spent some time today triaging the issue, and it seems that the issue is that during deserialization of dicts, sets and frozensets, the hash function is called before the child items are totally reconstructed, so apparently, the technique that works for lists and tuples (i.e. putting not-fully reconstructed pointers into the list), does not work here: https://github.com/python/cpython/blob/1f9025a4e7819bb4f7799784710f0f3750a9ca31/Lib/pickle.py#L1820-L1844

It seems like you aren't the first person to encounter this, see this SO question.

yes i am aware, i included it in the issue description.

ZeroIntensity commented 2 weeks ago

Oh, my bad. I'm guilty of only reading the reproducer from time to time :laughing: I guess this is a bug, thank you for the report!

A quick search shows that this issue was actually reported at bpo-1761028 over 15 years ago, but it was decided there that it was too complicated to fix, so I doubt anyone else will want to work on this -- would you like to author the PR? A fair warning though: this might be impossible to fix without breaking backwards compatibility (because we might need to move where __hash__ is called).

picnixz commented 2 weeks ago

cc @serhiy-storchaka

charles-cooper commented 2 weeks ago

would you like to author the PR? A fair warning though: this might be impossible to fix without breaking backwards compatibility (because we might need to move where __hash__ is called).

i have a partial fix here (against v3.11.10 tag 0c47759eee3e170e04a5dae82f12f6b375ae78f7)

diff --git a/Lib/pickle.py b/Lib/pickle.py
index f760bcdcba9..062a0e0ff6c 100644
--- a/Lib/pickle.py
+++ b/Lib/pickle.py
@@ -1204,6 +1204,9 @@ def load(self):
         self.proto = 0
         read = self.read
         dispatch = self.dispatch
+
+        self.deferred = []
+
         try:
             while True:
                 key = read(1)
@@ -1212,6 +1215,8 @@ def load(self):
                 assert isinstance(key, bytes_types)
                 dispatch[key[0]](self)
         except _Stop as stopinst:
+            for f in self.deferred:
+                f()
             return stopinst.value

     # Return a list of items pushed in the stack after last MARK instruction.
@@ -1701,12 +1706,15 @@ def load_setitems(self):
     def load_additems(self):
         items = self.pop_mark()
         set_obj = self.stack[-1]
-        if isinstance(set_obj, set):
-            set_obj.update(items)
-        else:
-            add = set_obj.add
-            for item in items:
-                add(item)
+
+        def finalize():
+            if isinstance(set_obj, set):
+                set_obj.update(items)
+            else:
+                add = set_obj.add
+                for item in items:
+                    add(item)
+        self.deferred.append(finalize)
     dispatch[ADDITEMS[0]] = load_additems

     def load_build(self):
@@ -1774,6 +1782,7 @@ def _loads(s, /, *, fix_imports=True, encoding="ASCII", errors="strict",

 # Use the faster _pickle if possible
 try:
+    raise ImportError()  # testing
     from _pickle import (
         PickleError,
         PicklingError,

but it only works for sets, and doesn't appear to scale to other data structures. for example, when i extend the technique to load_setitems, the cycle occurs during execution of the deferred functions. i suspect setitems is getting called for object serialization somewhere in my example, but i'm not sure how much time i have to continue looking into this.

ZeroIntensity commented 2 weeks ago

FWIW, we can't fix this on 3.11 -- that's security-only. I would write the fix towards against the main branch, and then depending on the complexity of the fix, we can backport it to 3.12 at the lowest.

JelleZijlstra commented 2 weeks ago

I don't really see room for a fix here. When pickle deserializes an object, it first creates it with no attributes (without calling __init__), then deserializes all the attributes and sets them on the instance. But here, to create the attributes, we need to put the object itself in a set. Initializing a set means hashing the members, and this object's __hash__() doesn't work yet when there are no attributes yet.

You can fix that for sets specifically with a fix like the one above, but it will still fail for other kinds of references.

In any case, your code puts a mutable object in a set, and the hash changes later. That's questionable anyway; it may mean that you can't find back the object later.

I would recommend to not make any changes here.

serhiy-storchaka commented 2 weeks ago

Pickle has nothing to do with it. The exception is raised earlier.

ZeroIntensity commented 2 weeks ago

The exception occurs during the deserialization process, but yeah it would be a complicated (or possibly impossible) fix.

serhiy-storchaka commented 2 weeks ago

I see, there are other examples. Anyway, the bug is in the user code.

charles-cooper commented 2 weeks ago

In any case, your code puts a mutable object in a set, and the hash changes later. That's questionable anyway; it may mean that you can't find back the object later.

The variant I posted above demonstrates the issue does not have to do with whether the hash changes or not. It only has to do with the fact that hash is called before the object is fully instantiated.

https://github.com/python/cpython/issues/124937#issuecomment-2391811319

ZeroIntensity commented 2 weeks ago

It only has to do with the fact that hash is called before the object is fully instantiated.

And moving that would probably break things :(

charles-cooper commented 2 weeks ago

I see, there are other examples. Anyway, the bug is in the user code.

I disagree. Because of how dict deserialization works, hash is called on the objects before they are properly instantiated.

serhiy-storchaka commented 2 weeks ago

I see, there are more examples. And even without recursive hash.

You can solve this problem by implementing either __new__(), so you would have functioning object before adding it to a set,

def __new__(cls):
    self = super().__new__(cls)
    self.y = 5
    return self

or __reduce__() that returns a function which creates the object and the set in one step, without creating a loop in pickle.

def __reduce__(self):
    return Foo, ()
charles-cooper commented 2 weeks ago

I see, there are more examples. And even without recursive hash.

You can solve this problem by implementing either __new__(), so you would have functioning object before adding it to a set,

def __new__(cls):
    self = super().__new__(cls)
    self.y = 5
    return self

or __reduce__() that returns a function which creates the object and the set in one step, without creating a loop in pickle.

def __reduce__(self):
    return Foo, ()

yea, these are interesting. I shrank the examples for this bug report, so I'm not sure if the __new__ approach will work, since the real example does not set self.y to a constant, it depends on input. the __reduce__ approach might work, let me check. shouldn't the pickle implementation do something like this to break the loop in the deserializer then?

btw, here is the full example which is causing the issue for me: https://github.com/charles-cooper/vyper/commit/2a5facf43ffb08b66f76d3d2b16b8f72f22d0625

the cycle is being raised when reconstructing this object: https://github.com/charles-cooper/vyper/blob/2a5facf43ffb08b66f76d3d2b16b8f72f22d0625/vyper/semantics/analysis/base.py#L229-L232

charles-cooper commented 1 week ago

@serhiy-storchaka the __reduce__ approach seemed to work. can't pickle do something similar internally to break the cycle during serialization?

in the meantime, i see no reason this cannot work for any dataclass, so i'll submit a PR for that.

serhiy-storchaka commented 1 week ago

Indeed, looking on your example I now wonder why dataclass(frozen=True) does not do this by default. This is not general pickle issue, but it may be a dataclasses issue.

cc @ericvsmith

charles-cooper commented 1 week ago

Indeed, looking on your example I now wonder why dataclass(frozen=True) does not do this by default. This is not general pickle issue, but it may be a dataclasses issue.

cc @ericvsmith

i would say it's both :). none of the snippets i posted in this thread use dataclasses, but apparently dataclasses have enough information to work around the issue anyway.