sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.33k stars 453 forks source link

Use weak references to cache homsets #11521

Closed jpflori closed 11 years ago

jpflori commented 13 years ago

Originally, this ticket was about the following memory leak when computing with elliptic curves:

sage: K = GF(1<<55,'t')
sage: a = K.random_element()
sage: while 1:
....:     E = EllipticCurve(j=a); P = E.random_point(); 2*P; 

This example is in fact solved by #715. However, while working on that ticket, another leak has been found, namely

sage: for p in prime_range(10^5):
....:     K = GF(p)
....:     a = K(0)
....:     
sage: import gc
sage: gc.collect()
0

So, I suggest to start with #715 and solve the second memory leak on top of it. It seems that a strong cache for homsets is to blame. I suggest to use the weak TripleDict instead, which were introduced in #715.

To be merged with #715. Apply

Depends on #12969 Depends on #715

Dependencies: #12969; to be merged with #715

CC: @jpflori @nthiery

Component: coercion

Keywords: sd35

Author: Simon King, Nils Bruin

Reviewer: Jean-Pierre Flori, Nils Bruin, Simon King

Merged: sage-5.5.beta0

Issue created by migration from https://trac.sagemath.org/ticket/11521

kiwifb commented 12 years ago
comment:49

It sometimes happen that the sage session itself crash on exit. This is probably one of these. Last time I got one it was related to singular I think. It is quite difficult to corner these with gdb. The best you can hope is start a sage session with gdb and then try the last doctest sequence and quit sage, it may lead to the crash in which case you may have some luck with gdb. But this is one of these case where gdb itself may be interfering. I don't think I have time to look into this right now but I'll put it into my "To do" list in case it isn't solved when i have time to spare.

simon-king-jena commented 12 years ago

Use weak references for the keys of the homset cache. If weak references are not supported, then raise an error, pointing out that category objects should be CategoryObjects.

simon-king-jena commented 12 years ago

Work Issues: Understand why a weak key dictionary is not enough

simon-king-jena commented 12 years ago
comment:50

Attachment: trac11521_new_homset_cache.patch.gz

I have attached a new patch version. It fixes the segfault I mentioned. However, it also does not fix the memory leak.

The difference between the two versions is: The new patch still uses weak references to the key of the cache, but a strong reference to the value (i.e., the homset).

The homset has a reference to domain and codomain, which constitute the cache key. Thus, I expected that it does not make any difference whether one has a strong or a weak reference to the homset. But I stand corrected. That needs to be investigated more deeply.

jpflori commented 12 years ago
comment:51

Dear Simon,

Thanks a lot for taking care of all of this !

I'm just back from vacation and will have a look at all your patches in the following days.

I must point out that even if the memory leak was small, it did still mater because I used a LOT of them and after several hours of computations it ate all the available memory the piece of code in the ticket description is just a minimal example, in my actual code I used different curves and similar simple computations on them)...

And to make things clear, I must say I put that ticket as need review in order to get it closed as wont fix/duplicate because I thought it could be seen as a concrete example of ticket [ticket:715] and all the work could be done there.

Of course youre the one currently doing all the wok, so do as you want :)

Cheers,

JP

simon-king-jena commented 12 years ago
comment:52

Hi Jean-Pierre,

Replying to @jpflori:

I must point out that even if the memory leak was small,

It isn't small.

And to make things clear, I must say I put that ticket as need review in order to get it closed as wont fix/duplicate because I thought it could be seen as a concrete example of ticket [ticket:715] and all the work could be done there.

I am not sure whether it would be good to do everything on one ticket, as the topics are related, but clearly disting: #715 is about weak "TripleDict" for coercion, #12215 is about a weak version of cached_function, and the ticket here is about the cache of homsets.

On the other hand: I am about to post a new patch here, with #715 as a dependency. It will use the new version of TripleDict from #715. So, one could argue that there is a common tool for both tickets, and they belong together.

Anyway. The new patch will fix the leak, but it will not suffer from the segfaults.

Cheers,

Simon

simon-king-jena commented 12 years ago
comment:53

I have attached another patch under a new name, using a new approach: The weak TripleDict, that I introduce at #715, is an appropriate tool for the cache of homsets. The key is the triple (domain, codomain, category), and the value is a weak reference to the corresponding homset.

There is a new test (the same as in the other patch), showing that the leak is fixed. And all tests in sage/schemes, sage/rings, sage/categories and sage/structure pass.

Hence: Needs review!

Apply trac11521_triple_homset.patch

simon-king-jena commented 12 years ago

Changed dependencies from #11900 to #11900 #715

simon-king-jena commented 12 years ago

Changed work issues from Understand why a weak key dictionary is not enough to none

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -12,3 +12,7 @@
 - one does not change the curve in the loop

 - does P+P instead of a multiplication
+
+**Apply**
+
+[attachment: trac11521_triple_homset.patch](https://github.com/sagemath/sage-prod/files/10653173/trac11521_triple_homset.patch.gz)
simon-king-jena commented 12 years ago
comment:54

In fact all tests pass for me, with #9138, #11900, #11115, #715 and attachment: trac11521_triple_homset.patch applied on top of sage-4.8.alpha3.

jpflori commented 12 years ago
comment:55

Applied all patches mentionned in the previous comment without problems on sage-4.8.alpha5 and "make ptestlong" passed all tests. I'll try to check that the memory leaks actually disappeared :) today and have a look at your patches to give them positive reviews as well.

simon-king-jena commented 12 years ago
comment:56

It depends on what you call "the" leak.

At least, the patch fixes one leak, namely the one that is doctested against. I am not sure whether it is enough to fix the leak exposed in the ticket description:

sage: K = GF(1<<55,'t')
sage: a = K.random_element()
sage: get_memory_usage()
872.26171875
sage: while 1:
....:     E = EllipticCurve(j=a)
....:     P = E.random_point()
....:     Q = 2*P;
....:     print get_memory_usage()

The memory usage does climb, but it climbs a lot slower than, for example, in sage-4.6.2.

simon-king-jena commented 12 years ago
comment:57

PS: I just tested that even #12215 (which is a lot more aggressive in using weak references and fixes a gaping leak) is not enough to totally fix the example from the ticket description. Hence, I think that, for now, we should be happy with a partial fix and investigate the remaining problems on different tickets.

simon-king-jena commented 12 years ago
comment:58

After running the example for a couple of minutes, interrupting and doing garbage collection, I find:

sage: from sage.schemes.elliptic_curves.ell_finite_field import EllipticCurve_finite_field
sage: LE = [x for x in gc.get_objects() if  isinstance(x,EllipticCurve_finite_field)]
sage: len(LE)
632
sage: import objgraph
sage: from sage.rings.finite_rings.finite_field_prime_modn import FiniteField_prime_modn as FF
sage: LF = [x for x in gc.get_objects() if isinstance(x, FF)]
sage: len(LF)
1

LF shows that one leak is fixed. However, the curves in LE, which are all defined over the same finite field, can not be collected.

simon-king-jena commented 12 years ago
comment:59

Using objgraph, I found that the remaining leak seem to be related with sage.schemes.generic.homset.SchemeHomsetModule_abelian_variety_coordinates_field_with_category. Since this is another homset, it would make sense to try and fix it here.

simon-king-jena commented 12 years ago
comment:60

Aha!!

I found that we have many equal instances of these scheme homsets:

sage: LE = [x for x in gc.get_objects() if isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
sage: LE[100] == LE[150]
True

So, I guess we should fix the memory leak by using UniqueRepresentation or UniqueFactory!

jpflori commented 12 years ago
comment:61

My 2 cents: Isn't it somehow related to the fact that elliptic curves are not unique parents? In some of my original tests, I used different curves and depending on the cache where they were stored "equality" testing was made either with "is" or with "==". In the former case, the "same" elliptic curve would be stored several times making the "leak" even bigger.

simon-king-jena commented 12 years ago
comment:62

OK. Then why not reduce the non-uniqueness? EllipticCurve could easily be turned into a cached function, which would mean that two elliptic curves became identical if they are defined by equal data. That is not enough to be a unique parent (there could be equal elliptic curves defined by different data), but it could be a step forward. And with #12215, it could then actually be turned into a weak cached function.

simon-king-jena commented 12 years ago
comment:63

Yes, using a cached function would indeed fix the leak that is cause by the useless creation of creating an elliptic curve with the same basic data over and over again. In particular, it would fix the leak from the ticket description (#12215 is not needed for that).

I am preparing a new patch (it requires do tests and stuff), so, it is "needs work" for now.

simon-king-jena commented 12 years ago
comment:64

It is definitely not easy. It seems that the elliptic curve code relies on the non-uniqueness of elliptic curves: I get loads of errors.

simon-king-jena commented 12 years ago
comment:65

Ah! Tuples of elements of different rings can be equal, but the corresponding elliptic curves would be different. So, the ring needs to be part of the description.

simon-king-jena commented 12 years ago
comment:66

I am undecided what we should do:

We could argue that my patch does fix some memory leak, and leave it like that (modulo comments of the reviewer, of course). In order to fix the memory leak exposed by the example from the ticket description, we have no chance but to have some kind of uniqueness for elliptic curves. But that is a different topic and should thus be dealt with on a different ticket (perhaps such ticket already exists?).

Or we could argue that this ticket is about fixing the memory leak that is exposed in the description. Hence, we should do all necessary steps here.

And then, there is still the question whether the number theorists really want the elliptic curves be "weakly unique" (i.e., identical if the given data are equal). In addition, we might want that the elliptic curve cache is weak - which might imply that we have to wait for #12215.

What do you think?

I guess I'll also ask on sage-nt.

jpflori commented 12 years ago
comment:67

I think we can go your way and stop working on this ticket now (or rather once I've taken the time to go through your patches to properly review them). Thus we can already some non negligible improvements merged.

Anyway the problem of uniqueness of elliptic curve is highly non trivial and deserves its own ticket. I guess #11474 would be a right place to treat that problem. An alternative would be to make this ticket depend on #11474.

simon-king-jena commented 12 years ago
comment:68

Hi Jean-Pierre,

indeed, you found the correct ticket.

And there is no need to ask on sage-nt, since I already did before opening #11474: Sage-nt first seemed to agree that uniqueness is good, so I opened the ticket. But then, they became convinced that much of the elliptic curve stuff depends on choices (generators), so that we should consider elliptic curves to be mutable objects, for which we wouldn't like to have uniqueness.

Considering the discussion on sage-nt, #11474 probably is "wontfix".

simon-king-jena commented 12 years ago
comment:69

I am not sure whether we should really stop work right there. After all, it is still not 100% clear to me why the elliptic curve E, that is created in the loop in an increasing number of copies, can not be garbage collected.

First, E is created, and some coercion into it is created. The coercion is cached. By #715, some key of the cache provides a weak reference to E. In addition, the coerce map refers to a homset, and the homset refers to its codomain, which is E. I wonder whether there is a chain of strong references from E to the homset (one could try to find out using objgraph, I guess). If there is, then we would have a reference cycle. If that was the case, then we needed to find a __del__ method that prevents the cycle from being garbage collected.

jpflori commented 12 years ago
comment:70

Objgraph only shows a reference left as _codomain in a dict in a SchemeHomsetModule_a_v_c_f (defined in sage.schemes.generic.homset).

jpflori commented 12 years ago
comment:71

The homset of points of the ab. group of the curve is itself reference by an IntegerMulAction, the point at infinity on the curve (no idea when it gets created) and a dict with 11 elements. I guess the problem might be that in addition to the _cache in sage.categories.homset the homset of points is directly link within the Action object ? That could also be nonsense.

simon-king-jena commented 12 years ago
comment:72

The IntegerMulAction should be fine, as it is stored in a TripleDict (hence, weakly, by #715). Where does the dict occur?

simon-king-jena commented 12 years ago
comment:73

I am a bit puzzled.

I see that there are a couple of cycles, involving an elliptic curve, a SchemeHomsetModule..., an IntegerMulAction, and an EllipticPoint_finit_field. However, none of them has a __del__ method, thus, the garbage collection should be able to collect the cycles.

But the backref graph also shows something that I don't understand: Apparently there is a top level dict with three items that points to the elliptic curve. Where is it?

simon-king-jena commented 12 years ago
comment:74

I got it!!

By #715, both the keys and the value of sage.structure.coerce_dict.TripleDict are by weak reference -- if possible. If the value does not allow for a weak reference, then (to simplify code) a constant function is used instead.

Actions are stored in such TripleDicts. However, an IntegerMulAction can not be weakly referenced. Hence, there is a strong reference, and no garbage collection can occur (the value points back to the keys, hence, they can't be collected either).

Solution: Make IntegerMulAction weakly referenceable! That should suffice.

simon-king-jena commented 12 years ago
comment:75

It does not suffice. But I think it is a step to the right direction.

simon-king-jena commented 12 years ago
comment:76

Apparently the weak reference to the value is not used in the TripleDict! So, I have to look at #715, perhaps I forgot to implement something there.

simon-king-jena commented 12 years ago
comment:77

Aha, I was mistaken: At #715 I only took care for weak references to the keys of TripleDict. The big question is whether we additionally want weak references to the values. I am somehow against doing this at #715.

jpflori commented 12 years ago
comment:78

My point was that it seemed to me that there was some (not weak!) reference in the dictionary ZZ._action_hash pointing to the IntegerMulAction itself pointing to the curve and I thought it could prevent garbage collection.

So I added a cpdef function to be able to clear that dictionary and see if garbage collection occurs once it is emptied, however it is a the C level so kind of the whole Sage library had to be rebuilt and I had to go back home before rebuilding was finished...

Anyway, I'll have a look at it tomorrow :)

By the way, have you any idea if dictionaries declared at the C level such as ZZ._action_hash are detected by objgraph ?

simon-king-jena commented 12 years ago
comment:79

Hi Jean-Pierre,

The thing with the _action_hash is a good finding. I thought it would be a TripleDict, but it is just a usual dict. And this could indeed be a problem. I don't know if this is visible to objgraph.

But also I think that in addition we have the problem of strong references to the values of TripleDict. In principle, one could use weak references for not only the keys but also to the values -- perhaps this could be done in #715.

Or one could leave TripleDict as it is currently in #715, but explicitly use weak references to functors for coercion (which needs to be enabled first). _action_hash then has to use weak references as well.

There might be a speed problem, though.

jpflori commented 12 years ago
comment:80

For info, resetting ZZ._action_hash to {} is not sufficient to let the IntegerMulAction be garbage collected.

Objgraph shows 3 objects pointing to the abelian group of points of the curve (itself pointing to the curve):

I'd be happy to test a weakref for values version of your patch regardless of the spedd impact.

jpflori commented 12 years ago
comment:81

I guess you're solution should be the right one because resetting the coercion cache solves the problem, so dealing with it should be enough for the example in the ticket description.

Only one copy gets cached in _action_hash because the curves are equal (==) but not identical (is). However, if one uses (really) different curves, one of each gets also cached in _action_hash

Here is  small piece of code to test the effect of clearing different caches (the code to clear the ZZ cache is left as an exercise, anyway we should use weakrefs there):

sage: import gc, inspect
sage: load /usr/share/shared/objgraph.py # or whatever you should type to use objgraph
sage: from sage.schemes.elliptic_curves.ell_finite_field import EllipticCurve_finite_field
sage: K = GF(1<<60, 't')
sage: j = K.random_element()
sage: for i in xrange(100):
....:     E = EllipticCurve(j=j); P = E.random_point(); 2*P; del P; del E;
....:
sage: gc.collect()
68
sage: L = [x for x in gc.get_objects() if isintance(x, EllipticCurve_finite_field)]
sage: len(L)
100
sage: del L
sage: get_coercion_model().reset_cache()
sage: gc.collect()
6172
sage: L = [x for x in gc.get_objects() if isintance(x, EllipticCurve_finite_field)]
sage: len(L)
1
sage: del L
sage: ZZ.del_hash()
sage: gc.collect()
56
sage: L = [x for x in gc.get_objects() if isintance(x, EllipticCurve_finite_field)]
sage: len(L)
0
sage: del L
sage: for i in xrange(100):
....:     E = EllipticCurve(j=K.random_element()); P = E.random_point(); 2*P; del P; del E;
....:
sage: gc.collect()
174
sage: L = [x for x in gc.get_objects() if isintance(x, EllipticCurve_finite_field)]
sage: len(L)
100
sage: del L
sage: get_coercion_model().reset_cache()
sage: gc.collect()
738
sage: L = [x for x in gc.get_objects() if isintance(x, EllipticCurve_finite_field)]
sage: len(L) # _action_hash
100
sage: del L
sage: ZZ.del_hash()
sage: gc.collect()
5742
sage: L = [x for x in gc.get_objects() if isintance(x, EllipticCurve_finite_field)]
sage: len(L) # mmm got one left!!! not sure where it comes from yet...
1
simon-king-jena commented 12 years ago
comment:82

Hi Jean-Pierre,

Replying to @jpflori:

I guess you're solution should be the right one because resetting the coercion cache solves the problem, so dealing with it should be enough for the example in the ticket description.

But it is not so easy because...

Only one copy gets cached in _action_hash because the curves are equal (==) but not identical (is).

... elliptic curves are not unique parents.

I think it would be a mistake to use a weak reference to the value of TripleDict. I tried and got many errors - and I think this was because some important data (homsets, actions, ...) was garbage collected even though there was still a strong reference to domain and codomain. In that situation, the value must not be garbage collected.

sage: len(L) # mmm got one left!!! not sure where it comes from yet...

Don't forget the last copy of E that was defined in the loop!

Since I think that a weak reference to the values of TripleDict won't work: What else could one do? Or perhaps I should try again: It could be that the errors came from the wrong choice of a callback function.

jpflori commented 12 years ago
comment:83

Replying to @simon-king-jena:

Hi Jean-Pierre, Replying to @jpflori:

I guess you're solution should be the right one because resetting the coercion cache solves the problem, so dealing with it should be enough for the example in the ticket description.

But it is not so easy because...

Only one copy gets cached in _action_hash because the curves are equal (==) but not identical (is).

... elliptic curves are not unique parents.

Yep :)

I think it would be a mistake to use a weak reference to the value of TripleDict. I tried and got many errors - and I think this was because some important data (homsets, actions, ...) was garbage collected even though there was still a strong reference to domain and codomain. In that situation, the value must not be garbage collected.

I see...

sage: len(L) # mmm got one left!!! not sure where it comes from yet...

Don't forget the last copy of E that was defined in the loop!

In the new code I posted I explicitely added "del P; del E;" to the loop so no copy should be left. What's even stranger is that this remaining copy appears after the second loop (with random j invariants so different curves) but not after the first one (with constant j invariant so only one curve)!

Since I think that a weak reference to the values of TripleDict won't work: What else could one do? Or perhaps I should try again: It could be that the errors came from the wrong choice of a callback function.

Mmm, have to think more about all of that...

simon-king-jena commented 12 years ago
comment:84

I think I should describe the problem in more detail.

Currently,

Hence, if an action is stored in the action cache, then there will always be a chain of strong references from __main__ via the cache to the action, and further to domain and codomain, so that it can not be collected.

On the other hand, if weak references to the values of the action and coercion caches are used, then an action or a coercion could die even when domain and codomain were still alive. That's probably not good. To the very least, it would imply that actions would be needed to be re-created over and over again.

How could that be solved?

jpflori commented 12 years ago
comment:85

If I understand correctly, the keys to all these dictionaries are triples and what you've done is that ift elements in that triple are weakrefs to non strongly refed objects, the key-value pair gets deleted so that garbage collection occur for these only weakrefed objects?

Therefore, if ZZ._action_hash gets deleted, only weakrefs to the abelian groups of points of the curve should be left in the coercion model (in the second triple dict _action_maps) and it should not prevent garbage collection...

jpflori commented 12 years ago
comment:86

Maybe the problem is that the value in that last dict corresponding to the triple with a (abelian group of points of a) curve is a (weakref to unless not weakreferrable) the IntegerMulAction which itself has a strong ref to the curve which would prevent the curve to get collected?

jpflori commented 12 years ago
comment:87

This is basically what you concluded in Comment 74, so a solution could be to allow Functors to be weakreferreable or make them use weak refs for their [co]domains? You posted that making making IntegerMulAction weak referrable is not enough, could you post a patch applying these ideas so that I can play with it?

simon-king-jena commented 12 years ago

Experimental patch using weak references for actions

simon-king-jena commented 12 years ago
comment:88

Attachment: weak_ref_to_functor.patch.gz

I was mistaken: Using weak references to the actions (on top of the other patch) does fix the leak - see the experimental patch that I have just posted.

I did not run any doctests on it, yet. But I tested that only one SchemeHomsetModule... survives the garbage collection (and I did not delete E).

jpflori commented 12 years ago
comment:89

I confirm the leak is fixed with your last patch, just launched a ptestlong.

simon-king-jena commented 12 years ago
comment:90

My first impression from some tests is that the additional patch causes a massive slow-down.

jpflori commented 12 years ago
comment:91

I ran the sage test suite on the same pc, on a 4.8.alpha5 with patches and a 4.7.2 without patches, just on the schemes directory and got 1512 vs 1060 sec... some files required between 2 and 3 times more time (e.g. hyperelliptic_padic_field, heegner and ell_number_field which are already quite long, i'd say most of the file already long), others did not change at all.

simon-king-jena commented 12 years ago
comment:92

Replying to @jpflori:

I ran the sage test suite on the same pc, on a 4.8.alpha5 with patches and a 4.7.2 without patches, just on the schemes directory and got 1512 vs 1060 sec...

That is not half as bad as I thought! However, it is far from good.

Do you also have the time with only the first patch, resp. with #715 only? After all, #715 uses weak references and may thus be responsible for some slow-down.