Open vbraun opened 11 years ago
Description changed:
---
+++
@@ -60,3 +60,7 @@
AttributeError: ("'Foo' object has no attribute 'i'", <function unreduce at 0x2742140>, (<class '__main__.Bar'>, (<__main__.Foo object at 0x756ed90>,), {}))
Note that its not really a problem about computing the hash; One could use a different hash function but then the same problem would arise when comparing incomplete objects: You need to know self.i
to compare instances, that is, to decide whether they should be taken from the UniqueRepresentation
cache.
+
+The two solutions that I see are
+ either give up on pickling, or on uniqueness
+ do a two-step unpickling where the first step unpickles without trying to enforce uniqueness, and a second step replaces object with their identical counterpart from the UniqueRepresentation
cache. The latter is quite tricky since objects can be entries in tuples, for example, and without CPython hacks it is not possible to replace them.
Is it possible to "give up on pickling" automatically, i.e. so that there is no need to explicitly clear cache as it was done with toric varieties?
The problem lies in the way in which Python pickles by default (namely: Create a new instance and fill its __dict__
). This problem will always hunt you, and I experienced similar problems with homsets, if the following issues come together.
OPy
whose pickling is via __new__
and filling __dict__
OPy
has an attribute which uses OPy
as key (in your example, this attribute is the UniqueRepresentation
which uses self
as key).I understand from comment:2 that you used a drastical solution in #15050: empty the cache. I think it would be better to provide OPy
with a __reduce__
method.
Example
Here is an example that shows that UniqueRepresentation
is not to blame:
sage: class Foo(object):
....: def __init__(self, i):
....: self.i = i
....: self.bar = {1:2, self:3}
....: def __hash__(self):
....: return int(self.i)
....:
sage: f = Foo(2)
sage: loads(dumps(f))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-7cad277c698c> in <module>()
----> 1 loads(dumps(f))
/mnt/local/king/SAGE/GIT/sage/local/lib/python2.7/site-packages/sage/structure/sage_object.so in sage.structure.sage_object.loads (sage/structure/sage_object.c:11044)()
<ipython-input-1-4e7295304e31> in __hash__(self)
4 self.bar = {Integer(1):Integer(2), self:Integer(3)}
5 def __hash__(self):
----> 6 return int(self.i)
7
AttributeError: 'Foo' object has no attribute 'i'
sage: hash(f)
2
And here is how to fix it:
sage: class Foo(object):
....: def __init__(self, i):
....: self.i = i
....: self.bar = {1:2, self:3}
....: def __hash__(self):
....: return int(self.i)
....: def __reduce__(self):
....: return Foo, (self.i,)
....:
sage: f = Foo(2)
sage: loads(dumps(f))
<__main__.Foo at 0x57f6c10>
So basically you are saying to never pickle UniqueRepresentation with cyclic references. Certainly fixes the error, but basically means that we give up on pickling in general (aka. the option number one in the ticket description).
It is not a good option for the toric varieties code, as it would mean to manually recreate the input for the toric variety to reduce the CohomologyRing class. It could be automated by rewriting UniqueRepresentation.__reduce__
using sage_input()
. Of course then every class that derives from UniqueRepresentation must implement _sage_input_
to be serializable.
The problem is not in how Python unpickles, I would claim that it is a fundamental disconnect between serialization (in the presence of cyclic references) and making objects unique. You just can't achieve both aims in a single pass, no matter how you implement it.
Isn't it that pickling circular references in general needs a little care? I think the issue may be that UniqueRepresentation
by itself is basically already a (global) cache, so combined with caches stored on the objects, circularity will be increased.
I may be wrong, but isn't it the case on #15050 that circularity arises because a toric variety caches its cohomology ring and the cohomology ring (being derived from the toric variety) refers back to the toric variety?
UniqueRepresentation
means that a toric variety wouldn't HAVE to refer to its cohomology ring, since a freshly created cohomology ring will be looked up to be the same one anyway.
It may be the case that caching the cohomology ring on the toric variety explicitly is still desirable to avoid the penalty of using UniqueRepresentation, but that cache is then of a very different type than normal and it would be the kind of cache that doesn't need to be pickled.
It may also be that computing the cohomology ring from scratch is a very expensive operation, so that it is desirable to pickle the cohomology ring. However, in that case, perhaps someone should do that explicitly. The corollary would be that the toric variety gets pickled too.
It may seem a bit of a pain at first to have to think about avoiding circularities (i.e., in this case trying to enforce that the toric variety doesn't refer explicitly to its cohomology ring) and I don't hava a proof that you can always avoid circularity, but the fact that mathematics tries to avoid circularity in its definitions, suggests we could go a long way in that direction.
It's not just pickling that has problems with circularity: The garbage collector also has trouble with it (EDIT: I mean a mild kind of trouble here. Nothing fatal). I think it's worth it to try and avoid circularity when possible and perhaps, when it's really necessary, take some care breaking it in select places before pickling.
Note that we also don't pickle discovered coercion graphs. Trying to do so would certainly spell disaster.
The reason why Jan tripped over this is that @parallel
(and certainly any clustering implementation) will have to pickle relatively complicated objects that don't already appear in Sage. Its true that toric varieties don't need to refer to the cohomology ring directly, but we definitely want to refer to (cache) certain elements of the cohomology ring that can be the result of expensive computations. I would fine it very alarming if pickling can't be relied on, and I recall William also saying that serialization is one of the strengths of Sage.
There is no problem with pickling arbitrarily complicated objects involving cyclic references, as far as I know cPickle can handle that in all cases. This ticket is about the inability (at least right now) to do so if you also want to enforce uniqueness, and what can be done about it.
With this approach OOP things like X.Mori_cone()
must be just wrappers for procedural Mori_cone(X)
, which will perhaps call X._compute_Mori_cone()
because efficient computation requires access to private parts of X
. This does avoid circularity, but sticking @cached_method
on X.Mori_cone()
which contains actual computation code is much more convenient in my opinion.
Determining uniqueness of a given toric variety should not rely on cache status in any way, so theoretically it seems that it should be possible to construct non-cached part of the toric variety and decide whether it is new or already existing, then use it to construct its cached objects, and then set the cache to contain these cached objects.
OK, fair enough. Though thats a bug in Python http://bugs.python.org/issue1761028, though not one that is going to be fixed. It fails for the same reason: the two dictionary keys may have to be identified, but to decide that you need the full object.
Replying to @novoselt:
With this approach OOP things like
X.Mori_cone()
must be just wrappers for proceduralMori_cone(X)
, which will perhaps callX._compute_Mori_cone()
because efficient computation requires access to private parts ofX
. This does avoid circularity, but sticking@cached_method
onX.Mori_cone()
which contains actual computation code is much more convenient in my opinion.
Correct. My initial inclination would be to say that it's better to avoid circularity when possible, so the design where Mori_cone(X)
is the gateway would be preferable (I expect that Mori_cone
has to refer to X
, so avoiding references from X
to Mori_cone
is the appropriate route to avoid circularity).
It may well be that for efficiency reasons, we would want that in actuality X
does keep a reference to its Mori_cone
(for instance, if computation is expensive and necessary internally, we might want to keep the cone alive for the lifetime of X
; something UniqueRepresentation
by itself won't do. The "right" solution would be to center the computations around the cone instead, because that'll keep X
alive anyway, but I imagine doing so may be too onerous).
In that case, the question becomes how to implement this while staying largely in line with our original design. One place we have just identified where this behaviour is important is pickling. So we could just have the caches in question marked as "wipe on pickle" (or perhaps that should be the default and have "pickle this cache" as the option)
A consequence is that dumps(X)
and dumps([X,X.Mori_cone()])
lead to pickles with a very different effect if the computation of X.Mori_cone()
is expensive to compute, but that would be something we would have to live with. There will always be a difference between the effort required to get something to work and to get something to work efficiently.
Of course, there may be other designs that we can make work. I would think the purpose of this ticket is to evaluate our options.
Replying to @vbraun:
So basically you are saying to never pickle UniqueRepresentation with cyclic references. Certainly fixes the error, but basically means that we give up on pickling in general (aka. the option number one in the ticket description).
It is not a good option for the toric varieties code, as it would mean to manually recreate the input for the toric variety to reduce the CohomologyRing class. It could be automated by rewriting
UniqueRepresentation.__reduce__
usingsage_input()
. Of course then every class that derives from UniqueRepresentation must implement_sage_input_
to be serializable.The problem is not in how Python unpickles, I would claim that it is a fundamental disconnect between serialization (in the presence of cyclic references) and making objects unique. You just can't achieve both aims in a single pass, no matter how you implement it.
I suspect the key might lie in the fact that the pickle machinery DOES allow for a second pass: Reduce allows returning a tuple of length at most 5: (constructor,args,state,append_iterator,setitem_iterator)
which instructs the unpickle machine to execute roughly the following
obj = constructor(*args)
obj.__setstate__(state)
for a in append_iterator:
obj.append(a)
for key,value in setitem_iterator:
obj[key]=value
(the last two entries, append_iterator
and setitem_iterator
are just there for convenience: their functionality could be obtained from an appropriate __setstate__
)
The key here is that circularity and recursive loops are only a problem for things included in args
. In the later phases, obj
already exists as a functional python object; its state might not be the final, desired one yet.
Things like caches are originally created by modifying the object after it already exists, so if we pickle caches, we should do so in a way that they get recreated during setstate
, not as part of args
.
So I don't think there is a fundamental obstruction to pickling arbitrarily circular data structures. You just have to make sure that the circularity gets inserted at the setstate
phase. In reality, this just reflects how the circularity was obtained in the data structure to begin with!
I haven't found the smoking gun yet, but it seems to me that if presently pickling can get ruined due to cached information, it's because __cached_methods
is being populated in construction, not in setstate
.
In the example in the description of the ticket, one solution would be to move the initialization of self.bar
needs to be relegated to __setstate__
. That, or provide a custom __reduce__
that reproduces the exact Foo(...)
call, because that ensures the i
is present before Bar
is used, so that __hash__
works.
Replying to @nbruin:
I suspect the key might lie in the fact that the pickle machinery DOES allow for a second pass: Reduce allows returning a tuple of length at most 5:
(constructor,args,state,append_iterator,setitem_iterator)
... The key here is that circularity and recursive loops are only a problem for things included inargs
.
I disagree.
We talk here about unpickling a Python object O that has an attribute A whose construction relies on the hash of O. If I understand correctly how Python's default way of pickling works, args
is empty (since there is no __initargs__
method). Instead, a new instance of O.__class__
is created without calling __init__
, and A is passed in state
(not in args
) and is used to fill the new instance's __dict__
.
In the later phases,
obj
already exists as a functional python object; its state might not be the final, desired one yet.
Not the way Python's default pickling works! This is what I tried to say in comment:3: You need to implement pickling for O.__class__
, by a __reduce__
method, or perhaps an __initargs__
would work as well, because then (as you say) the object would exist as a functional python object before filling its __dict__
.
So I don't think there is a fundamental obstruction to pickling arbitrarily circular data structures. You just have to make sure that the circularity gets inserted at the
setstate
phase.
Sure. "You have to make sure...", which means "you must implement pickling, as you can't rely on Python's default way of serialisation".
I haven't found the smoking gun yet, but it seems to me that if presently pickling can get ruined due to cached information, it's because
__cached_methods
is being populated in construction, not insetstate
.
The example in comment:3 has already shown that __cached_methods
is not relevant.
In the example in the description of the ticket, one solution would be to move the initialization of
self.bar
needs to be relegated to__setstate__
. That, or provide a custom__reduce__
that reproduces the exactFoo(...)
call, because that ensures thei
is present beforeBar
is used, so that__hash__
works.
Exactly.
Replying to @simon-king-jena:
The key here is that circularity and recursive loops are only a problem for things included in
args
.I disagree.
OK, my remark is not quite relevant for this ticket. I was more thinking of problems a la #15156.
What we see on both tickets is that doing too much/too little at construction/setstate time leads to pickling problems.
It also seems that some of the code presently in sage does not take very much care in handling this.
So the lesson still stands: In the face of pickling problems, pay attention to your construction/setstate balance. I think that's a useful one for people having to provide their own reduction methods, and for the near future, people should probably watch out that the problems may well lie in existing code, not just their own additions.
The unpickling of
UniqueRepresentation
is fragile, and we keep tripping over that at #15050, #15149.I claim that the issue is a fundamental problem: To decide uniqueness, you need the complete object. It is in principle impossible to decide whether to use an already-constructed isomorphic object without reconstructing the complete object first. But, as soon as cyclic references are used, you cannot avoid incompletely reconstructed objects during the unpickling process.
As a minimal example, consider
Now instances cannot be unpickled, because the hash cannot be evaluated without filling in
self.i
first:Note that its not really a problem about computing the hash; One could use a different hash function but then the same problem would arise when comparing incomplete objects: You need to know
self.i
to compare instances, that is, to decide whether they should be taken from theUniqueRepresentation
cache.The two solutions that I see are
UniqueRepresentation
cache. The latter is quite tricky since objects can be entries in tuples, for example, and without CPython hacks it is not possible to replace them.CC: @simon-king-jena @sagetrac-jkeitel @novoselt
Component: misc
Issue created by migration from https://trac.sagemath.org/ticket/15155