sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.33k stars 453 forks source link

Parents probably not reclaimed due to too much caching #715

Closed robertwb closed 11 years ago

robertwb commented 17 years ago

Here is a small example illustrating the issue.

The memory footprint of the following piece of code grows indefinitely.

sage: K = GF(1<<55,'t') 
sage: a = K.random_element() 
sage: while 1: 
....:     E = EllipticCurve(j=a); P = E.random_point(); 2*P; del E, P;

E and P get deleted, but when 2*P is computed, the action of integers on A, the abelian group of rational points of the ellitpic curve, gets cached in the corecion model.

A key-value pair is left in coercion_model._action_maps dict:

(ZZ,A,*) : IntegerMulAction

Moreover there is at least also references to A in the IntegerMulAction and one in ZZ._action_hash.

So weak refs should be used in all these places if it does not slow things too much.

To be merged with #11521. Apply:

and then the patches from #11521.

Depends on #13145 Depends on #13741 Depends on #13746 Depends on #11521

Dependencies: #13145, #13741, #13746, to be merged with #11521

CC: @jpflori @zimmermann6 @vbraun @robertwb @nbruin @malb @orlitzky

Component: coercion

Keywords: weak cache coercion Cernay2012

Author: Simon King, Jean-Pierre Flori

Reviewer: Jean-Pierre Flori, Simon King, Nils Bruin

Merged: sage-5.5.beta0

Issue created by migration from https://trac.sagemath.org/ticket/715

mwhansen commented 15 years ago
comment:3

I think this is a bit too vague for a ticket. Robert, could you be more specific or close this?

robertwb commented 15 years ago
comment:4

The coercion model needs to use weakrefs so that parents aren't needlessly referenced when they're discarded. It is nontrivial to see where the weakrefs need to go, and how to do so without slowing the code down.

The ticket is still valid.

11d1fc49-71a1-44e1-869f-76be013245a0 commented 13 years ago

Description changed:

--- 
+++ 
@@ -1 +1 @@
-
+(Moving this to "coercion", which is clearly where it belongs.)
jpflori commented 13 years ago

Description changed:

--- 
+++ 
@@ -1 +1,20 @@
-(Moving this to "coercion", which is clearly where it belongs.)
+Here is a small example illustrating the issue.
+
+The memory footprint of the following piece of code grows indefinitely.
+
+```
+sage: K = GF(1<<55,'t') 
+sage: a = K.random_element() 
+sage: while 1: 
+....:     E = EllipticCurve(j=a); P = E.random_point(); 2*P; del E, P;
+
+```
+E and P get deleted, but when 2*P is computed, the action of integers on A, the abelian group of rational points of the ellitpic curve, gets cached in the corecion model.
+
+A key-value pair is left in coercion_model._action_maps dict:
+
+(ZZ,A,*) : IntegerMulAction
+
+Moreover there is at least also references to A in the IntegerMulAction and one in ZZ._action_hash.
+
+So weak refs should be used in all these places if it does not slow things too much.
jpflori commented 13 years ago
comment:7

With the piece of code in the desrciption, there is only one reference to these objects in that ZZ._hash_actions dictionary because to build it we test if A1 == A2 and not A1 is A2 as in coercion_model._action_maps, and because of the current implementation of ellitpic curves, see http://groups.google.com/group/sage-nt/browse_thread/thread/ec8d0ad14a819082 and #11474, and decause the above code use only one j-invariant, only ones gets finally stored.

However with random curves, I guess there would be all of them.

About the weakref, the idea should more be to build something like WeakKeyDictionnary if it does not slow down coercion too much...

nbruin commented 13 years ago
comment:8

The following example also exhibits a suspicious, steady growth in memory use. The only reason I can think of why that would happen is that references to the created finite field remain lying around somewhere, preventing deallocation:

sage: L=prime_range(10^8)
sage: for p in L: k=GF(p)

If you change it to the creation of a polynomial ring the memory use rises much faster:

sage: L=prime_range(10^8)
sage: for p in L: k=GF(p)['t']

Are "unique" parents simply never deallocated?

jpflori commented 13 years ago
comment:9

Be aware that polynomial rings are also cached because of uniqueness of parents, explaining somehow your second memory consumption; see #5970 for example.

For finite fields I did not check.

jpflori commented 12 years ago
comment:11

See #11521 for some concrete instances of this problem and some advice to investigate it.

simon-king-jena commented 12 years ago
comment:12

In my code for the computation Ext algebras of basic algebras, I use letterplace algebras (see #7797), and they involve the creation of many polynomial rings. Only one of them is used at a time, so, the others could be garbage collected. But they aren't, and I suspect this is because of using strong references in the coercion cache.

See the following example (using #7797)

sage: F.<a,b,c> = FreeAlgebra(GF(4,'z'), implementation='letterplace')
sage: import gc
sage: len(gc.get_objects())
170947
sage: a*b*c*b*c*a*b*c
a*b*c*b*c*a*b*c
sage: len(gc.get_objects())
171556
sage: del F,a,b,c
sage: gc.collect()
81
sage: len(gc.get_objects())
171448
sage: cm = sage.structure.element.get_coercion_model()
sage: cm.reset_cache()
sage: gc.collect()
273
sage: len(gc.get_objects())
171108

That is certainly not a proof of my claim, but it indicates that it might be worth while to investigate.

In order to facilitate work, I am providing some other tickets that may be related to this:

I guess that one should use a similar cache model to what I did in #11521: The key for the cache should not just be (domain,codomain), because we want that garbage collection of the cache item is already allowed if just one of domain or codomain is collectable.

simon-king-jena commented 12 years ago
comment:13

I try to wrap my mind around weak references. I found that when creating a weak reference, one can also provide a method that is called when the weak reference becomes invalid. I propose to use such method to erase the deleted object from the cache, regardless whether it appears as domain or codomain.

Here is a proof of concept:

sage: ref = weakref.ref
sage: D = {}
sage: def remove(x):
....:     for a,b,c in D.keys():
....:         if a is x or b is x or c is x:
....:             D.__delitem__((a,b,c))
....:             
sage: class A:
....:     def __init__(self,x):
....:         self.x = x
....:     def __repr__(self):
....:         return str(self.x)
....:     def __del__(self):
....:         print "deleting",self.x
....:         
sage: a = A(5)
sage: b = A(6)
sage: r = ref(a,remove)
sage: s = ref(b,remove)
sage: D[r,r,s] = 1
sage: D[s,r,s] = 2
sage: D[s,s,s] = 3
sage: D[s,s,1] = 4
sage: D[r,s,1] = 5
sage: D.values()
[5, 3, 1, 4, 2]
sage: del a
deleting 5
sage: D.values()
[4, 3]
sage: del b
deleting 6
sage: D.values()
[]
simon-king-jena commented 12 years ago
comment:14

It turns out that using weak references in the coercion cache will not be enough. Apparently there are other direct references that have to be dealt with.

simon-king-jena commented 12 years ago
comment:15

I wonder whether the problem has already been solved. I just tested the example from the ticket description, and get (at least with #11900, #11521 and #11115):

sage: K = GF(1<<55,'t')
sage: a = K.random_element()
sage: m0 = get_memory_usage()
sage: for i in range(1000):
....:     E = EllipticCurve(j=a); P = E.random_point(); PP = 2*P
....:     
sage: get_memory_usage() - m0
15.22265625

I think that this is not particularly scary. I'll repeat the test with vanilla sage-4.8.alpha3, but this will take a while to rebuild.

simon-king-jena commented 12 years ago
comment:16

No, even in vanilla sage-4.8.alpha3 I don't find a scary memory leak in this example.

Do we have a better example? One could, of course, argue that one should use weak references for caching even if we do not find an apparent memory leak. I am preparing a patch for it now.

simon-king-jena commented 12 years ago
comment:17

Here is an experimental patch.

A new test shows that the weak caching actually works.

Note that the patch also introduces a weak cache for polynomial rings, which might be better to put into #5970. Well, we can sort things out later...

simon-king-jena commented 12 years ago
comment:18

It needs work, though. Some tests in sage/structure fail, partially because of pickling, partially because some tests do not follow the new specification of TripleDict (namely that the first two parts of each key triple and the associated value must be weak referenceable.

simon-king-jena commented 12 years ago
comment:19

Now I wonder: Should I try to use weak references and make it accept stuff that does not allow for weak references?

In the intended applications, weak references are possible. But in some tests and in the pickle jar, the "wrong" type of keys (namely strings and ints) are used.

simon-king-jena commented 12 years ago
comment:20

The only place where the weak references are created is in the set(...) method of TripleDict. I suggest to simply catch the error that may occur when creating a weak reference, and then use a different way of storing the key. I am now running tests, and I hope that this ticket will be "needs review" in a few hours.

simon-king-jena commented 12 years ago
comment:21

With the attached patch, all tests pass for me, and the new features are doctested. Needs review!

simon-king-jena commented 12 years ago

Author: Simon King

simon-king-jena commented 12 years ago

Changed keywords from none to weak cache coercion

simon-king-jena commented 12 years ago

Dependencies: #11900

simon-king-jena commented 12 years ago
comment:22

It turns out that this patch only cleanly applies after #11900. So, I introduce #11900 as a dependency. My statement on "doctests passing" was with #11900 anyway.

zimmermann6 commented 12 years ago
comment:23

I was able to apply this patch to vanilla 4.7.2. Should I continue reviewing it like this?

Paul

zimmermann6 commented 12 years ago
comment:24

on top of vanilla 4.7.2 several doctests fail:


        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/calculus/interpolators.pyx # 0 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/databases/database.py # 15 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/finance/time_series.pyx # 0 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph_list.py # 4 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph_database.py # 28 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/graph.py # 6 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/graphs/generic_graph.py # 4 doctests failed

        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/matrix/matrix2.pyx # 3 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/hecke/hecke_operator.py # 1 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/hecke/ambient_module.py # 2 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/subspace.py # 6 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/boundary.py # 3 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/space.py # 3 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/modsym.py # 1 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/modsym/ambient.py # 11 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/modular/abvar/abvar.py # 0 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/schemes/elliptic_curves/heegner.py # 9 doctests failed
        sage -t  4.7-linux-64bit-ubuntu_10.04.1_lts-x86_64-Linux/devel/sage-715/sage/sandpiles/sandpile.py # Time out

Paul

simon-king-jena commented 12 years ago
comment:25

I'll try again on top of vanilla sage-4.8.alpha3. You are right, the patch does apply (almost) cleanly even without #11900. That surprises me, because at some point there was an inconsistency.

Hopefully I can see later today whether I get the same errors as you.

simon-king-jena commented 12 years ago

Changed dependencies from #11900 to none

simon-king-jena commented 12 years ago
comment:26

It turns out that #11900 is indeed not needed.

I can not reproduce any of the errors you mention.

Moreover, the file "sage/devel/sage/databases/database.py", for which you reported an error, does not exist in vanilla sage (not in 4.7.2 and not in 4.8.alpha3).

Did you test other patches before returning to vanilla 4.7.2? Namely, when a patch changes a module from python to cython, and one wants to remove the patch, then it is often needed to also remove any reference to the cython module in build/sage/... and in build/*/sage/.... For example, when I had #11115 applied and want to remove it again, then I would do rm build/sage/misc/cachefunc.* and rm build/*/sage/misc/cachefunc.*.

zimmermann6 commented 12 years ago
comment:27

yes I tried other patches (#10983, #8720, #10596) before #715, but each one with a different clone.

Paul

simon-king-jena commented 12 years ago
comment:28

Replying to @zimmermann6:

yes I tried other patches (#10983, #8720, #10596) before #715, but each one with a different clone.

But where does the databases/database.py file come from?

And could you post one or two examples for the errors you are getting (i.e. not just which files are problematic, but what commands exactly fail)?

simon-king-jena commented 12 years ago
comment:29

FWIW: I started with sage-4.8.alpha3, have #9138, #11900 and #715 applied, and all doctests pass. I don't know why the patchbot isn't even trying (although it says "retry: True"), but from my point of view, everything is alright.

simon-king-jena commented 12 years ago
comment:30

I have simplified the routine that removes cache items when a weak reference became invalid. The tests all pass for me.

Apply trac715_weak_coercion_cache.patch

vbraun commented 12 years ago

Dependencies: #9138, #11900

simon-king-jena commented 12 years ago
comment:32

One question: Currently, my patch uses weak references only for the first two parts of the key. Should it also use weak references to the value, when possible?

By "when possible", I mean that not all values allow weak references - if it is possible then a weak reference is used, otherwise a strong reference is used. This might contribute to fixing the memory leak in #11521, but it might have a speed penalty.

Concerning #11521: The point is that an action (which currently does not allow weak references, but that might change) has a strong reference to the objects that are used for storing it in the cache. Hence, an action is not collectable with the current patch.

Thoughts?

simon-king-jena commented 12 years ago

Attachment: trac715_weak_coercion_cache.patch.gz

Use weak references in the coercion cache

simon-king-jena commented 12 years ago
comment:33

I have slightly updated some of the new examples: In the old patch version, I had created TripleDict(10), but meanwhile I learnt that the given parameter should better be odd (actually a prime). So, in the new patch version, it is TripleDict(11).

simon-king-jena commented 12 years ago

Work Issues: Comparison of the third key items

simon-king-jena commented 12 years ago
comment:34

I think I need to modify one detail:

For efficiency and since domain/codomain of a map must be identic with (and not just equal to) the given keys, my patch compares them by "is" rather than "==". But I think one should still compare the third item of a key via "==" and not "is". I need to do some tests first...

simon-king-jena commented 12 years ago
comment:35

It really is not an easy question whether or not we should have "is" or "==".

On the one hand, we have the lines

!python
            if y_mor is not None:
                all.append("Coercion on right operand via")
                all.append(y_mor)
                if res is not None and res is not y_mor.codomain():
                    raise RuntimeError, ("BUG in coercion model: codomains not equal!", x_mor, y_mor)

in sage/structure/coerce.pyx seem to imply that comparison via "is" is the right thing to do.

But in the same file, the coercion model copes with the fact that some parents are not unique:

!python
        # Make sure the domains are correct
        if R_map.domain() is not R:
            if fix:
                connecting = R_map.domain().coerce_map_from(R)
                if connecting is not None:
                    R_map = R_map * connecting
            if R_map.domain() is not R:
                raise RuntimeError, ("BUG in coercion model, left domain must be original parent", R, R_map)
        if S_map is not None and S_map.domain() is not S:
            if fix:
                connecting = S_map.domain().coerce_map_from(S)
                if connecting is not None:
                    S_map = S_map * connecting
            if S_map.domain() is not S:
                raise RuntimeError, ("BUG in coercion model, right domain must be original parent", S, S_map)

That would suggest that comparison by "==" (the old behaviour or TripleDict) is fine.

Perhaps we should actually have to variants of TripleDict, one using "is" and one using "==".

Note another detail of sage/structure/coerce.pyx: We have

    cpdef verify_action(self, action, R, S, op, bint fix=True):

but

    cpdef verify_coercion_maps(self, R, S, homs, bint fix=False):

Note the different default value for "fix". If "fix" is True then the coercion model tries to cope with non-unique parents by prepending a conversion between the two equal copies of a parent.

Since the default is to fix non-unique parents for actions, but not for coercion maps, I suggest that a "=="-TripleDict should be used for actions and an "is"-TripleDict for coercions.

jpflori commented 12 years ago
comment:36

I guess a choice has to be made and that it should at lest be as consistent as possible. What you propose makes sense to me, is not too far from the current model and gives a little more conssitency. Moreover, when both TripleDicts will have been implemented, changing our mind later will be trivial.

simon-king-jena commented 12 years ago
comment:37

There is another detail. Even in the old version of TripleDict, we have

    It is implemented as a list of lists (hereafter called buckets). The bucket 
    is chosen according to a very simple hash based on the object pointer.
    and each bucket is of the form [k1, k2, k3, value, k1, k2, k3, value, ...]
    on which a linear search is performed. 

So, the choice of a bucket is based on the object pointer - but then it is not consequent to compare by "==".

simon-king-jena commented 12 years ago
comment:38

To be precise: The old behaviour was not consequent. The bucket depended on id(k1),id(k2),id(k3), but the comparison was by "==" rather than by "is".

Experimentally, I will provide two versions of TripleDict, one using "hash"for determining the bucket and doing comparison by "==", the other using "id" for determining the bucket and doing comparison by "is".

simon-king-jena commented 12 years ago
comment:39

As announced, I have attached an experimental patch. It provides two variants of TripleDict, namely using "==" or "is" for comparison, respectively. Both are used, namely for caching coerce maps or actions, respectively.

It could be that a last-minute change was interfering, but I am confident that all but the following three tests pass:

        sage -t  devel/sage-main/doc/en/bordeaux_2008/nf_introduction.rst # 1 doctests failed
        sage -t  devel/sage-main/sage/modular/modsym/space.py # Killed/crashed
        sage -t  devel/sage-main/sage/structure/coerce_dict.pyx # 3 doctests failed

The memory leak exposed in the ticket description is fixed (more or less):

sage: K = GF(1<<55,'t')
sage: a = K.random_element()
sage: for i in range(500):
....:     E = EllipticCurve(j=a)
....:     P = E.random_point()
....:     Q = 2*P
....:     
sage: import gc
sage: gc.collect()
862
sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
sage: LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
sage: len(LE)
2

I am not sure whether this makes #11521 redundant.

For now, it is "needs work, because of the doctests. But you can already play with the patch.

simon-king-jena commented 12 years ago

Changed work issues from Comparison of the third key items to fix doctests

simon-king-jena commented 12 years ago
comment:40

Sorry, only TWO doctests should fail: The tests of sage/structure/coerce_dict.pyx are, of course, fixed.

simon-king-jena commented 12 years ago
comment:41

The segfault in sage -t devel/sage-main/sage/modular/modsym/space.py seems difficult to debug.

Inspecting a core dump with gdb did not help at all:

(gdb) bt
#0  0x00007f61d12ca097 in kill () from /lib64/libc.so.6
#1  0x00007f61d0044a40 in sigdie () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libcsage.so
#2  0x00007f61d0044646 in sage_signal_handler () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libcsage.so
#3  <signal handler called>
#4  0x00007f61cf080520 in mpn_submul_1 () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8
#5  0x00007f61cf0b4f0f in __gmpn_sb_bdiv_q () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8
#6  0x00007f61cf0b6428 in __gmpn_divexact () from /home/simon/SAGE/sage-4.8.alpha3/local/lib/libgmp.so.8
#7  0x00007f61ccbf4d64 in ?? ()
...
#191 0x55c0ade81d9aeecf in ?? ()
#192 0xffffe4b8b6920b7b in ?? ()
#193 0x000000000ac854cf in ?? ()
#194 0x0000000000000000 in ?? ()

How could one proceed? What other debugging techniques can you recommend?

vbraun commented 12 years ago
comment:42

Looks like you did not tell gdb about the executable you were running. You should run

gdb --core=<corefile> $SAGE_LOCAL/bin/python
simon-king-jena commented 12 years ago
comment:43

Replying to @vbraun:

Looks like you did not tell gdb about the executable you were running.

No, I did tell it. I did

gdb --core=715doublecore ~/SAGE/sage-4.8.alpha3/local/bin/python

Should I do it inside a Sage shell?

simon-king-jena commented 12 years ago
comment:44

No, doing the same inside a sage shell did not help either.

simon-king-jena commented 12 years ago
comment:45

I am now printing some debugging information into a file, which hopefully means that I am coming closer to the source of the problem. The segfault arises in line 2165 of sage/modular/modsym/space.py

simon-king-jena commented 12 years ago
comment:46

Sorry, it was the wrong line number.