fix more leaks found in #18897

dimpase commented 9 years ago

In #18897 one leak is fixed, but there are more left, see comments 27 and later:

sage: L = [1,1,0,1,0,1,2,1,1,0,0,1,2,0,0,1,1,0,2,0,0,0,0,0,1]
sage: def test(L, dim):
....:     import gc
....:     from collections import Counter
....:     gc.collect()
....:     pre={id(c) for c in gc.get_objects()}
....:     m = matrix(dim, L)
....:     for p in range(2,102):
....:         m.change_ring(GF(nth_prime(p))).eigenvalues()
....:     gc.collect()
....:     post=Counter(type(o) for o in gc.get_objects() if id(o) not in pre)
....:     return [(k,v) for (k,v) in post.iteritems() if v>10]
....: 
sage: test(L, 5)
[(<class 'sage.rings.algebraic_closure_finite_field.AlgebraicClosureFiniteField_pseudo_conway_with_category'>,
  100),
...
 (<class 'sage.rings.homset.RingHomset_quo_ring_with_category'>, 100),
...
 (<type 'sage.rings.finite_rings.integer_mod.NativeIntStruct'>, 100),
 (<type 'sage.rings.finite_rings.integer_mod.Int_to_IntegerMod'>, 200),
...
 (<class 'sage.rings.homset.RingHomset_generic_with_category'>, 100),
...]

CC: @jm58660

Component: memleak

Author: Simon King

Branch/Commit: public/18905 @ e952388

Issue created by migration from https://trac.sagemath.org/ticket/18905

simon-king-jena commented 9 years ago

Description changed:

--- 
+++ 
@@ -1 +1,28 @@
-In #18897 one leak is fixed, but there are more left, see comments 27 and later.
+In #18897 one leak is fixed, but there are more left, see comments 27 and later:
+
+```
+sage: L = [1,1,0,1,0,1,2,1,1,0,0,1,2,0,0,1,1,0,2,0,0,0,0,0,1]
+sage: def test(L, dim):
+....:     import gc
+....:     from collections import Counter
+....:     gc.collect()
+....:     pre={id(c) for c in gc.get_objects()}
+....:     m = matrix(dim, L)
+....:     for p in range(2,102):
+....:         m.change_ring(GF(nth_prime(p))).eigenvalues()
+....:     gc.collect()
+....:     post=Counter(type(o) for o in gc.get_objects() if id(o) not in pre)
+....:     return [(k,v) for (k,v) in post.iteritems() if v>10]
+....: 
+sage: test(L, 5)
+[(<class 'sage.rings.algebraic_closure_finite_field.AlgebraicClosureFiniteField_pseudo_conway_with_category'>,
+  100),
+...
+ (<class 'sage.rings.homset.RingHomset_quo_ring_with_category'>, 100),
+...
+ (<type 'sage.rings.finite_rings.integer_mod.NativeIntStruct'>, 100),
+ (<type 'sage.rings.finite_rings.integer_mod.Int_to_IntegerMod'>, 200),
+...
+ (<class 'sage.rings.homset.RingHomset_generic_with_category'>, 100),
+...]
+```

simon-king-jena commented 9 years ago

Dependencies: #18897

pjbruin commented 9 years ago

comment:2

This is probably because the algebraic_closure() method of finite fields is cached (see #14990), creating circular references between finite fields and their algebraic closures. The following simpler code exhibits a similar memory leak:

import gc                                                                                   
from collections import Counter
gc.collect()
pre = {id(c) for c in gc.get_objects()}
for p in prime_range(100):
    GF(p).algebraic_closure()
gc.collect()
post = Counter(type(o) for o in gc.get_objects() if id(o) not in pre)
print([(k,v) for (k,v) in post.iteritems() if v>10])

simon-king-jena commented 9 years ago

comment:3

Replying to @pjbruin:

This is probably because the algebraic_closure() method of finite fields is cached (see #14990), creating circular references between finite fields and their algebraic closures.

Circular references shouldn't be problematic, unless one of the objects involved has a __del__ method. However:

sage: K = GF(5)
sage: hasattr(K, '__del__')
False
sage: hasattr(K.algebraic_closure(), '__del__')
False

And indeed:

sage: while 1:
....:     print get_memory_usage()
....:     for p in range(2, 102):
....:         A = GF(nth_prime(p)).algebraic_closure()
....:         
<constant amount of memory>

simon-king-jena commented 9 years ago

comment:4

Oops, that's a bad test. Of course it had a constant memory consumption even if memory could not be freed. Sorry.

sage: import gc
sage: _ = gc.collect()
sage: while 1:                                  
....:     print gc.collect()
....:     for p in range(2, 102):
....:         A = GF(nth_prime(p)).algebraic_closure()
....:         
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

That's a leak.

seblabbe commented 9 years ago

comment:5

Replying to @pjbruin:

This is probably because the algebraic_closure() method of finite fields is cached (see #14990), creating circular references between finite fields and their algebraic closures. The following simpler code exhibits a similar memory leak:
import gc                                                                                   
from collections import Counter
gc.collect()
pre = {id(c) for c in gc.get_objects()}
for p in prime_range(100):
    GF(p).algebraic_closure()
gc.collect()
post = Counter(type(o) for o in gc.get_objects() if id(o) not in pre)
print([(k,v) for (k,v) in post.iteritems() if v>10])

Like noticed in #18897, when this code is runned the first time, there are many stuff left. But the second time, the list is empty.

simon-king-jena commented 9 years ago

comment:6

And the leak indeed only occurs if we store the algebraic closure. In a freshly started session (the previous result has also been in a freshly started session):

sage: _ = gc.collect()
sage: while 1:                                  
....:     print gc.collect()
....:     for p in range(2, 102):
....:         A = GF(nth_prime(p))
....:         
0
8396
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969
7969

seblabbe commented 9 years ago

comment:7

Like noticed in #18897, when this code is runned the first time, there are many stuff left. But the second time, the list is empty.

...ok, but that's the same thing for the bug presented in the description of the ticket. Sorry.

simon-king-jena commented 9 years ago

comment:8

Replying to @seblabbe:

Like noticed in #18897, when this code is runned the first time, there are many stuff left. But the second time, the list is empty.

That's just because there is only one base ring. Do the same while iterating over GF(p), and you'll see how things accumulate.

At #18897, a binary tree was duely deallocated, however it was forgotten to dereference the root node. But here, we have objects that could be deallocated, but aren't. It is a totally different kind of leak.

simon-king-jena commented 9 years ago

Attachment: test.png

Reference chain to finite field after creating its algebraic closure

simon-king-jena commented 9 years ago

comment:9

attachment: test.png results from the following code:

sage: import objgraph, gc
sage: K = GF(31)
sage: n = id(K)
sage: A = K.algebraic_closure()
sage: del A,K
sage: gc.collect()
0
sage: L = [c for c in gc.get_objects() if id(c)==n]
sage: objgraph.show_backrefs(L[0],filename="/home/king/Sage/work/memleak/test.png")
Graph written to /tmp/objgraph-r5RhSM.dot (21 nodes)
Image generated as /home/king/Sage/work/memleak/test.png

Apparently (since all other references are weak or circular), the references that prevent deallocation come from sage.rings.finite_rings.integer_mod.NativeIntStruct.

simon-king-jena commented 9 years ago

comment:10

It seems that there is a NativeIntStruct created for each finite field. If the NativeIntStruct is small enough, then it has a (multiplication?) table, which holds references to all elements of the finite field. And they have, of course, references to their parent. But what is referencing the NativeIntStruct?

simon-king-jena commented 9 years ago

comment:11

It seems that the reference somehow comes from constructing the algebraic closure, as we have

sage: K = GF(31)
sage: del K
sage: L = [c for c in gc.get_objects() if isinstance(c, sage.rings.finite_rings.integer_mod.NativeIntStruct) and len(c._get_table())==31]
sage: len(L)
0

So, the NativeIntStruct can be garbage collected when we do not construct the algebraic closure.

How can one find a reference chain from the algebraic closure to the NativeIntStruct?

simon-king-jena commented 9 years ago

Attachment: test2.png

Another reference chain, towards NativeIntStruct

simon-king-jena commented 9 years ago

comment:12

attachment: test2.png shows the result of

sage: K = GF(31)
sage: A = K.algebraic_closure()
sage: del A,K
sage: L = [c for c in gc.get_objects() if isinstance(c, sage.rings.finite_rings.integer_mod.NativeIntStruct) and len(c._get_table())==31]
sage: filter = lambda x: (x is not L) and (not isinstance(x, sage.rings.finite_rings.integer_mod.IntegerMod_int))
sage: objgraph.show_backrefs(L[0],filter=filter,filename="/home/king/Sage/work/memleak/test2.png")
Graph written to /tmp/objgraph-4qP1R3.dot (16 nodes)
Image generated as /home/king/Sage/work/memleak/test2.png

The picture somehow looks familiar: There is a coerce map involved, namely !Int_to_IntegerMod. Coerce maps are supposed to have a weak reference to the domain and a strong reference to the codomain. Since the domain of !Int_to_IntegerMod presumably is the ring of integers and can't be deallocated anyway, a strong reference to the codomain means trouble...

seblabbe commented 9 years ago

comment:13

I was also playing with objgraph...

def test(L, dim):
    import objgraph
    import gc
    from collections import Counter
    gc.collect()
    pre={id(c) for c in gc.get_objects()}
    m = matrix(dim, L)
    for p in range(2,102):
        m.change_ring(GF(nth_prime(p))).eigenvalues()
    gc.collect()
    O = gc.get_objects()
    post=Counter(type(o) for o in O if id(o) not in pre)
    T = [k for (k,v) in post.iteritems() if v==100]
    D = dict((type(o),o) for o in O if type(o) in T)
    for i,v in enumerate(sorted(D.values())):
        print v
        objgraph.show_backrefs(v,filename="test_{}.png".format(i))
        print "--"

I get :

sage: L = [1,1,0,1,0,1,2,1,1,0,0,1,2,0,0,1,1,0,2,0,0,0,0,0,1]
sage: test(L, 5)
Principal ideal (547) of Integer Ring
Graph written to /tmp/objgraph-e803N8.dot (20 nodes)
Image generated as test_0.png
--
Set of Homomorphisms from Integer Ring to Finite Field of size 181
Graph written to /tmp/objgraph-I_uL5V.dot (12 nodes)
Image generated as test_1.png
--
<class 'sage.rings.finite_rings.conway_polynomials.PseudoConwayLattice'>
Graph written to /tmp/objgraph-oMuwMY.dot (13 nodes)
Image generated as test_2.png
--
Algebraic closure of Finite Field of size 181
Graph written to /tmp/objgraph-M_9gmt.dot (14 nodes)
Image generated as test_3.png
--
Finite Field of size 547
Graph written to /tmp/objgraph-ZostIO.dot (25 nodes)
Image generated as test_4.png
--
Set of Homomorphisms from Finite Field of size 181 to Univariate Polynomial Ring in x over Finite Field of size 181
Graph written to /tmp/objgraph-diDGKn.dot (12 nodes)
Image generated as test_5.png
--
<sage.rings.finite_rings.integer_mod.NativeIntStruct object at 0x7f366d1e3590>
Graph written to /tmp/objgraph-JxmEbJ.dot (30 nodes)
Image generated as test_6.png
--

I will attach test_3.png for the algebraic closure right now.

seblabbe commented 9 years ago

Ref graph for Algebraic closure of Finite Field of size 181

simon-king-jena commented 9 years ago

comment:14

Attachment: test_3.png

Let me try to recall why the current "weak-referencing coerce maps" are made as they are, to understand why we have a strong reference chain to Int_to_IntegerMod.

We have a backtracking algorithm to find coerce maps by transitivity of "registered" coerce maps. Hence, registered coerce maps need to be stored in the codomain.
Since the maps are stored in a container that is stored as an attribute of the codomain, there is a strong reference chain from the codomain to the coerce map.
A strong reference to the codomain should not prevent the domain from garbage collection. Hence, IN COERCION, we use a weak reference from the map to the domain, and we cut the reference from the map to its parent.
A strong reference to the domain should not prevent the codomain from garbage collection. That's what seems to fail in the current example.

What I do not understand: Why is that more than a strong reference CYCLE (which would not prevent garbage collection) from the codomain to the map and back?

simon-king-jena commented 9 years ago

comment:15

Aha. The coercion model caches the coerce maps, too. So, it isn't the codomain only that holds a cache.

The coercion model is of course a permanent object. It references a TripleDict to store the coercion maps. The coercion map references the codomain. Hence, if the domain is strongly referenced from somewhere, then the coerce map and thus the codomain can not be garbage collected.

Can we perhaps make it so that the cache in the coercion model only keeps a WEAK reference to the coerce map? I worry about performance, though, since getting the referenced object from a weak reference is a bit costly.

Do we need to worry about premature collection of coerce map? If I understand correctly, the cache in the coercion model is mainly for performance, as the map is cached as an attribute of the codomain anyway. And the codomain of any map is a parent. Hence, the cache of the coercion model actually is redundant.

So, perhaps a better idea is to completely get rid of the coercion model cache, as the coercion model can use the codomain's cache. I'll try that.

simon-king-jena commented 9 years ago

comment:16

No, the coercion model's cache is needed. It is relevant for pushouts. There, we have no codomain, as it first needs to be constructed.

simon-king-jena commented 9 years ago

comment:17

Perhaps my diagnosis was wrong: It is not relevant for pushouts, but we want to cache the ABSENCE of a coercion from a parent to, say, int. So, if the codomain happens to be a parent then we can use its cache; otherwise there is no problem to use the coercion model's cache since int and friends will never be garbage collected anyway.

So, better not use weak references to coercion maps...

simon-king-jena commented 9 years ago

comment:18

It seems that #14058 is relevant -- and perhaps it fixes our problem (except that it has no branch).

simon-king-jena commented 9 years ago

comment:19

I am afraid that #14058 (which now has a branch) does not suffice to fix the problem: GF(31) still can't be collected after creating its algebraic closure. However, the coercion model is not mentioned any longer. Instead, the reference goes via a weak value dictionary, which seems to be the polynomial ring cache.

My guess:

There is some polynomial ring P over the base ring GF(31) that was constructed when creating the algebraic closure of GF(31).
There is only a weak reference to P from the polynomial ring cache (it is weak value dictionary). However, for a different reason there is a strong reference to P.
Hence, the entry for P in the weak value dictionary can not be freed. By consequence, the reference to the KEY of this entry, which is a STRONG reference to GF(31), can not be freed.
Thus, GF(31) can not be deallocated.

The question is: How is there a strong reference to P? Perhaps there is a strong reference from GF(31) to P? Maybe via caching the algebraic closure, which references an element of P? This would be enough to keep P alive.

simon-king-jena commented 9 years ago

comment:21

I tried to find a backref chain as follows:

sage: import objgraph, gc, __main__
sage: K = GF(31)
sage: A = K.algebraic_closure()
sage: nK = id(K)
sage: del A,K
sage: L = [c for c in gc.get_objects() if id(c) == nK]
sage: objgraph.find_backref_chain(L[0], lambda x: x in __main__.__dict__.values(), extra_ignore=(id(L),))

but it crashes with a segfault. Any idea why it fails?

simon-king-jena commented 9 years ago

comment:22

And this one

sage: import objgraph, gc
sage: cm = sage.structure.element.get_coercion_model()
sage: K = GF(31)
sage: A = K.algebraic_closure()
sage: nK = id(K)
sage: del A,K
sage: gc.collect()
0
sage: L = [c for c in gc.get_objects() if id(c) == nK]
sage: objgraph.find_backref_chain(L[0], lambda x: x is cm, extra_ignore=(id(L),))

doesn't finish after several minutes.

nbruin commented 9 years ago

comment:23

I get pretty good plots from

objgraph.show_backrefs(L,filename='plot.png',max_depth=5)

It seems a global link exists here:

sage: [a for a in sage.rings.polynomial.polynomial_ring_constructor.__dict__['_cache'].keys() if id(a[0]) == id(L[0])]
[(Finite Field of size 31, ('x',), False, None),
 (Finite Field of size 31, ('x',), False, 'FLINT')]

simon-king-jena commented 9 years ago

comment:24

Replying to @nbruin:

I get pretty good plots from
objgraph.show_backrefs(L,filename='plot.png',max_depth=5)

Ahm, L?? That's a list.

It seems a global link exists here:

sage: [a for a in sage.rings.polynomial.polynomial_ring_constructor.__dict__['_cache'].keys() if id(a[0]) == id(L[0])]
[(Finite Field of size 31, ('x',), False, None),
 (Finite Field of size 31, ('x',), False, 'FLINT')]

Of course. As long as P=GF(31)['x'] lives, its base ring will live, too. The question is why the polynomial ring can't be collected. So, perhaps it would be better to try and find a chain for the polynomial ring instead.

simon-king-jena commented 9 years ago

Attachment: test_P_14058.png

Backref graph for a polynomial ring over finite field

simon-king-jena commented 9 years ago

comment:25

attachment: test_P_14058.png is interesting. It seems that the reference chain to the polynomial ring is via TRACEBACKS! The question then arises: Why are the tracebacks not garbage collected?

nbruin commented 9 years ago

comment:26

Replying to @simon-king-jena:

Of course. As long as P=GF(31)['x'] lives, its base ring will live, too. The question is why the polynomial ring can't be collected. So, perhaps it would be better to try and find a chain for the polynomial ring instead.

OK, the finite field caches under algebraic_closure in its dict a CachedMethodCaller that references a PseudoConwayLattice object that in its __dict__ has a ring that is the polynomial ring. That's a reference chain from the finite field to the polynomial ring, preventing the polynomial ring from being deallocated.

simon-king-jena commented 9 years ago

comment:27

Oops. I see that the chain goes via objgraph. So, I guess the new attachment is worthless.

simon-king-jena commented 9 years ago

comment:28

Replying to @nbruin:

OK, the finite field caches under algebraic_closure in its dict a CachedMethodCaller that references a PseudoConwayLattice object that in its __dict__ has a ring that is the polynomial ring. That's a reference chain from the finite field to the polynomial ring, preventing the polynomial ring from being deallocated.

And, by comment:19, it prevents the base ring from collection, because it is not simply a cyclic reference, but a strong reference from a weak value dictionary to one of its keys.

simon-king-jena commented 9 years ago

comment:29

Why does the CachedMethodCaller reference a PseudoConwayLattice? Shouldn't it "only" reference the return value?

simon-king-jena commented 9 years ago

comment:30

Replying to @simon-king-jena:

Why does the CachedMethodCaller reference a PseudoConwayLattice? Shouldn't it "only" reference the return value?

I see. The algebraic closure itself references the PseudoConwayLattice.

Hm. At some point we should use a weak reference. Maybe right here.

simon-king-jena commented 9 years ago

comment:31

The lattice that is stored in the algebraic closure is either passed as an argument to the init method, or is constructed during init:

    def __init__(self, base_ring, name, category=None, lattice=None, use_database=True):
        if not (is_FiniteField(base_ring) and base_ring.is_prime_field()):
            raise NotImplementedError('algebraic closures of finite fields are only implemented for prime fields')
        from sage.rings.finite_rings.conway_polynomials import PseudoConwayLattice
        p = base_ring.characteristic()
        if lattice is None:
            lattice = PseudoConwayLattice(p, use_database)
        elif not isinstance(lattice, PseudoConwayLattice) or lattice.p != p:
            raise TypeError('lattice must be a pseudo-Conway lattice with characteristic %s' % p)
        self._pseudo_conway_lattice = lattice
        AlgebraicClosureFiniteField_generic.__init__(self, base_ring, name, category)

That's a dilemma. If "lattice" is not passed as an argument, it is no problem to weakly reference it, as it can be reconstructed, should it be garbage collected. But otherwise? Hm.

simon-king-jena commented 9 years ago

comment:32

I think I know a potential solution.

The problem, by the above analysis: The polynomial ring cache is a weak value dictionary. Generally, a strong reference chain from key to value will prevent garbage collection of the key-value pair.

I suggest to remove the "global" polynomial ring cache. Instead, I suggest that the polynomial ring constructor uses a weak value dictionary that is stored as an attribute of the base ring (e.g., in self.__cached_methods, which is available for all parents).

The weakly referenced values are polynomial rings. The keys are the list of variable names and information on term order and implementation---so, strong references to them shouldn't be problematic.

In that model, a strong reference chain from the base ring to the polynomial ring would NOT prevent garbage collection, since in the worst case it is a reference cycle (base ring <-> polynomial ring).

simon-king-jena commented 9 years ago

comment:33

Hooray! With that change, I get

sage: import objgraph, gc
sage: K = GF(31)
sage: A = K.algebraic_closure()
sage: n = id(K)
sage: del A,K
sage: gc.collect()
186
sage: L = [c for c in gc.get_objects() if id(c) == n]
sage: L
[]

simon-king-jena commented 9 years ago

Branch: u/SimonKing/memleak_for_integer_mod_rel_14058

simon-king-jena commented 9 years ago

Last 10 new commits:

`6d34077`	`Simplify code for deallocation of binary trees`
`80ce876`	`Further simplification`
`ac68d9b`	`Remove strong references to parents used in binary operations in the coercion model.`
`e153d08`	`#14058: Add doctest`
`b9141ee`	`Merge branch 'ticket/14058' into develop`
`90ed181`	`Trivial fix for a coercion doctest`
`64b572c`	`refcount libsingular rings used in plural`
`9793dbc`	`Make one test more stable`
`219fbf4`	`Merge branch 't/14058/weakly_reference_binary_operation_codomains' into t/18905/memleak_for_integer_mod_rel_14058`
`3ada19d`	`Replace the global polynomial ring cache by a cache in the base ring`

simon-king-jena commented 9 years ago

Changed dependencies from #18897 to #18897, #14058

simon-king-jena commented 9 years ago

Author: Simon King

simon-king-jena commented 9 years ago

Commit: 3ada19d

simon-king-jena commented 9 years ago

comment:36

To my slight surprise, replacing the global polynomial ring cache by a local cache did not only solve the issue tracked here, but it did not introduce a new problem: With the attached branch, all tests should pass.

I have reviewed most part of #14058, but I think someone should have a look at my additions (review patch) there and finish the review.

videlec commented 9 years ago

comment:37

From the patchbot

sage -t --long src/sage/structure/coerce.pyx
**********************************************************************
File "src/sage/structure/coerce.pyx", line 1307, in sage.structure.coerce.CoercionModel_cache_maps.coercion_maps
Failed example:
    print N2-N0
Expected:
    0
Got:
    -1

Is it what we should get?

simon-king-jena commented 9 years ago

comment:38

Replying to @videlec:

From the patchbot ... Is it what we should get?

Note that the same error appears at #14058, even though the commits from here are not part of #14058, if I see that correctly.

Could someone verify if it is really the case that tests pass with "develop", but fail with the branch from here (merged in "develop" of course)? I currently do not have the bandwidth.

videlec commented 9 years ago

comment:39

Replying to @simon-king-jena:

Replying to @videlec: Could someone verify if it is really the case that tests pass with "develop", but fail with the branch from here (merged in "develop" of course)? I currently do not have the bandwidth.

It does fail, see #14058 comment 62.

simon-king-jena commented 9 years ago

comment:40

Replying to @videlec:

Replying to @simon-king-jena:

Replying to @videlec: Could someone verify if it is really the case that tests pass with "develop", but fail with the branch from here (merged in "develop" of course)? I currently do not have the bandwidth.

It does fail, see #14058 comment 62.

Please be clearer. It fails in what setting? Does it only fail with #14058? Does it also fail with develop? Does it also fail with the branch from here?

If it fails both here and at #14058, but not with develop, then I reckon both branches trigger a memory leak that was introduced elsewhere.

simon-king-jena commented 9 years ago

comment:41

Replying to @simon-king-jena:

Please be clearer. It fails in what setting? Does it only fail with #14058? Does it also fail with develop? Does it also fail with the branch from here?

Sorry for the noise. I just notice that #14058 is a dependency for the ticket here. So, we should focus on #14058.

simon-king-jena commented 9 years ago

comment:42

bump...

seblabbe commented 9 years ago

comment:43

I think the description of the ticket should be updated because I don't see those instances with multiple of one hundred anymore on sage-6.9.rc0. (Exactly what leaks are fixed in this ticket?)

Is the following line needed?

+from sage.structure.parent import Parent

Other than that, I verified that the branches indeed fixes the problem mentioned at comment 33. When the two thing above are fixed, to me, it will be a positive review.

sagemath / sage

fix more leaks found in #18897 #18905