sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.44k stars 480 forks source link

Parents probably not reclaimed due to too much caching #715

Closed robertwb closed 12 years ago

robertwb commented 17 years ago

Here is a small example illustrating the issue.

The memory footprint of the following piece of code grows indefinitely.

sage: K = GF(1<<55,'t') 
sage: a = K.random_element() 
sage: while 1: 
....:     E = EllipticCurve(j=a); P = E.random_point(); 2*P; del E, P;

E and P get deleted, but when 2*P is computed, the action of integers on A, the abelian group of rational points of the ellitpic curve, gets cached in the corecion model.

A key-value pair is left in coercion_model._action_maps dict:

(ZZ,A,*) : IntegerMulAction

Moreover there is at least also references to A in the IntegerMulAction and one in ZZ._action_hash.

So weak refs should be used in all these places if it does not slow things too much.

To be merged with #11521. Apply:

and then the patches from #11521.

Depends on #13145 Depends on #13741 Depends on #13746 Depends on #11521

Dependencies: #13145, #13741, #13746, to be merged with #11521

CC: @jpflori @zimmermann6 @vbraun @robertwb @nbruin @malb @orlitzky

Component: coercion

Keywords: weak cache coercion Cernay2012

Author: Simon King, Jean-Pierre Flori

Reviewer: Jean-Pierre Flori, Simon King, Nils Bruin

Merged: sage-5.5.beta0

Issue created by migration from https://trac.sagemath.org/ticket/715

simon-king-jena commented 12 years ago
comment:47

Meanwhile I am rather desperate: I have not the faintest idea how the segfault occurs.

Therefore I used some debugging function that I registered using sys.settrace(...), so that all Python commands in the critical example are written into a file.

I posted logs for the unpatched and the patched version.

There is one obvious difference of the two logs: The hash is called more often in the patched version. Calling the hash is rather inefficient for matrix spaces: Each time when the hash of a matrix space is called, the matrix space's string representation is created, which is slow. I suggest to cache the hash value (like what I did for polynomial rings in #9944), but this should be on a different ticket.

Apart from that, I can't spot an obvious difference. Do you have any clue?

simon-king-jena commented 12 years ago
comment:48

It turns out that using TripleDictById for the _action_maps cache makes the segfault disappear.

If one uses TripleDict for _coercion_maps then

sage -t  devel/sage-main/sage/modular/modsym/space.py

takes 30 seconds, but if one also uses TripleDictById then it only takes 23 seconds.

My conclusion:

However, the fact that using TripleDictById fixes the segfault makes me wonder: Perhaps the segfault occurs when calling hash(...) on a parent? Namely, in some cases, and action will already be constructed during initialisation of a parent. But if the hash is determined based on cdef data that aren't initialised, a segfault can easily occur.

I'll investigate that further. In any case, we need to keep an eye on the potential slow-down.

simon-king-jena commented 12 years ago
comment:49

The segfault does not occur while computing a hash. It occurs in line 468 of sage/matrix/matrix_rational_dense.pyx, namely

                mpq_mul(y, w._entries[j], self._matrix[j][i])

I also tested, just before that line, that w[j] and self.get_unsafe(j,i) (which accesses w._entries[j] and self._matrix[j],[i]) works.

At this point, I am at my wits' end. To me, it looks like a change in the way of comparing dictionary keys modifies internals of mpir (IIRC, this is where mpq_mul is defined). gdb can not decipher the core file, and I don't know how valgrind can be used.

What else?

vbraun commented 12 years ago
comment:50

Which patches did you apply? With only trac715_two_tripledicts.patch applied sage doesn't start.

simon-king-jena commented 12 years ago
comment:51

Replying to @vbraun:

Which patches did you apply? With only trac715_two_tripledicts.patch applied sage doesn't start.

What???

According to hg qapplied, I have

trac_12057_fix_doctests.patch
9138_flat.patch
trac_11319_prime_field_coercion.patch
trac_11319_number_field_example.patch
trac11900_category_speedup_combined.patch
11115_flat.patch
trac_11115_docfix.patch
trac715_two_tripledicts.patch

Remark: I work on openSUSE, hence, I had to apply #12131 and thus also its dependency #12057. I doubt that the absence of #11115 is responsible for Sage not starting. And all other patches are dependencies.

What error occurs when you start Sage with my patch? If we are lucky, it gives some clue why the segfault in the one doctest occurs.

Best regards,

Simon

simon-king-jena commented 12 years ago
comment:52

PS: I started on top of sage-4.8.alpha3.

simon-king-jena commented 12 years ago
comment:53

Meanwhile I built sage-5.0.prealpha0 and applied #11780 and attachment: trac715_two_tripledicts.patch. Sage starts fine.

So, Volker, what had you have applied when Sage didn't start?

simon-king-jena commented 12 years ago
comment:54

I think I made a progress: I found that the vector space that is part of the crash is not unique! So, the VectorMatrixAction is defined for a vector space that is equal to but not identical with the vector space it is acting on!

The natural solution is to try and find out why the vector space is not unique. Vector spaces should be created using the VectorSpace constructor, that relies on a UniqueFactory. But apparently some very old code is constructing a vector space directly - it wouldn't be the first time that this is causing trouble.

simon-king-jena commented 12 years ago
comment:55

PS: Note that vector spaces with different inner product are considered equal.

sage: V = QQ^5
sage: M = random_matrix(QQ,5,5)
sage: M.set_immutable()
sage: W = VectorSpace(QQ,5,inner_product_matrix=M)
sage: V
Vector space of dimension 5 over Rational Field
sage: W
Ambient quadratic space of dimension 5 over Rational Field
Inner product matrix:
[   0  1/2    1   -1   -1]
[   0    0    0    1 -1/2]
[  -2    0    0    0    0]
[   1    0    2    0    0]
[   0   -2    0    1    0]
sage: V==W
True
sage: type(V)==type(W)
False

But this is not the problem here: The two equal vector spaces involved in the crash have default inner product.

The non-uniqueness makes me think of another potential solution: The coercion model has a method "verify_action". This is only called when a new action is found, but not when an action is taken from the cache.

So, in addition to fixing the non-unique vector space in the modular symbols code, one could always verify the action. Probably this would be too slow, though.

simon-king-jena commented 12 years ago
comment:56

Aha! We have a sparse versus a dense vector space! Here is our problem!

vbraun commented 12 years ago
comment:57

I did manage to install it and reproduce the crash. The core dump shows that the stack is completely corrupted before we called into gmp code.

simon-king-jena commented 12 years ago
comment:58

Hi Volker,

good that you managed to install it. Meanwhile I think I can debug it without the core dump - I think mistaking a sparse with a dense vector space is a pretty convincing reason for a segfault.

However, I hate that old code!!

I tried verify_action, but then hundreds of tests fail in sage/modular/modsym/space.py. So, apparently it is very common to have non-unique parents in such a way that the action can not be fixed!

For example, I see errors like

    TypeError: Coercion of [Infinity] - [0] (of type <class 'sage.modular.modsym.boundary.BoundarySpaceElement'>) into Space of Boundary Modular Symbols for Congruence Subgroup Gamma0(43) of weight 2 and over Rational Field not (yet) defined.

Anyway, verify_action is no solution.

jpflori commented 12 years ago
comment:59

Hi all,

Just wanted to say I had no problem installing the new patch on top of sage.4.8.alpha5 with tickets #9138 #11900 #1115 #715 and #11521 in that order and Sage launches. I'll start a make ptestlong now.

simon-king-jena commented 12 years ago
comment:60

Hi Jean-Pierre,

don't start ptestlong - I am about to update the new patch such that the segfault does not occur and the time for executing the test is fine and the memleak is gone!

simon-king-jena commented 12 years ago

Use weak references to the keys of TripleDict. Compare by "==" or by "is", depending on the application. Use weak references for storing actions.

simon-king-jena commented 12 years ago

Changed work issues from fix doctests to none

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -18,3 +18,7 @@
 Moreover there is at least also references to A in the IntegerMulAction and one in ZZ._action_hash.

 So weak refs should be used in all these places if it does not slow things too much.
+
+Apply:
+
+[attachment: trac715_two_tripledicts.patch](https://github.com/sagemath/sage-prod/files/10637817/trac715_two_tripledicts.patch.gz)
simon-king-jena commented 12 years ago
comment:61

Attachment: trac715_two_tripledicts.patch.gz

See the updated patch:

Apply trac715_two_tripledicts.patch

simon-king-jena commented 12 years ago
comment:62

Here some remarks on the new patch:

I use TripleDictById for storing actions, since otherwise we have trouble with non-unique parents and get segfaults.

In addition, I do not directly store the action but only a weak reference to it, since otherwise I couldn't fix the memory leak.

Sometimes, the stored action is in fact None, for which we can't use a weak references. Instead, I use a constant function. For technical reasons it returns False and not None (namely, this is to avoid confusion with a weak reference that has become invalid).

Features

Thus, needs, review.

jpflori commented 12 years ago
comment:63

Ok, I'll give the new patch a go and report after make ptestlong and checking for the memleaks.

simon-king-jena commented 12 years ago
comment:64

Replying to @jpflori:

Ok, I'll give the new patch a go and report after make ptestlong

So do I.

I guess at least one thing is needed: Provide a doc test that demonstrates the fix of the memory leak. This should be similar to the example for the patch that I have posted at #11521. Note that #11521 is in fact a duplicate: The examples from the two ticket descriptions are almost identical.

vbraun commented 12 years ago
comment:65

Actions have strong references to domain and codomain, so its no surprise that they keep their coercion cache entry alive. But I don't understand how storing a weak reference to the action can work; Nothing else keeps the action alive unless it happens to be used while the garbage collector is running. So actions are essentially not cached any more. It seem that either actions should only store weak references to domain/codomain or we implement some ring buffer that keeps the last N coerce maps unconditionally alive.

In fact, the action's reference to domain and codomain seem to be for convenience only. After all you know domain and codomain when you constuct the action and when you pick it from the cache, so there shouldn't be much incentive to look it up. Perhaps it would be easy to make them weak refs, did you look into that?

jpflori commented 12 years ago
comment:66

I agree with Volker and would like to test putting weak refs to domain and codomain in Functor as I suggested in #11521 and letting an option to use strong ref by default so that a user building an action but not storing its domain elsewhere won't see it disappear magically.

Unfortunately I do not have much time to do anything more than testing till the end of the week.

simon-king-jena commented 12 years ago
comment:67

I wouldn't use weak references for anything but caching. In particular, having a weak reference from a functor to its domain or codomain seems a no-go to me.

In one point I agree: There should be a mechanism to keep an action alive as long as domain and codomain exist. But perhaps this is already the case? Isn't there an action cache as an attribute of any parent? And isn't the action stored there (and not only in the cache of the coercion model) when an action is discovered?

So, before thinking of a weak reference from the functor to domain and codomain, I would first test whether the problem you describe actually occurs.

simon-king-jena commented 12 years ago
comment:68

Just two mental notes:

One test in sage/structure/coerce.pyx fails, because it explicitly uses the action cache (ignoring the fact that it now contains weak references and not actions).

And: The long tests of these two files

devel/sage/sage/graphs/graph_generators.py
devel/sage/sage/graphs/generic_graph.py

take 10 minutes each. Is my patch to blame, or has it been like that before?

jpflori commented 12 years ago
comment:69

With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:

jpflori commented 12 years ago
comment:70

Oops, this should be more readable:

With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:

The random behavior of some of the above tests fails with:

jpflori commented 12 years ago
comment:71

For info, the number_field test also fails with an "IndexError: list out of range".

simon-king-jena commented 12 years ago
comment:72

The flaky behaviour probably means that sometimes something gets garbage collected when it shouldn't.

But why do you have the patch from #11521 applied?

Note that with sage-5.0.prealpha0 + #11780 + the new patch from here, I get two tests with errors, namely

        sage -t  --long -force_lib devel/sage/sage/structure/coerce_dict.pyx # 1 doctests failed
        sage -t  --long -force_lib devel/sage/sage/structure/coerce.pyx # 5 doctests failed

However, the tests took rather long in total: 12100 seconds with the new patch versus 4569 seconds unpatched.

I think the regression is not acceptable.

Well, perhaps you are right and we should experiment with weak references on domain and codomain.

simon-king-jena commented 12 years ago

Work Issues: avoid regression

jpflori commented 12 years ago
comment:73

Good point about #11521, I'd say because that what I was firstly interested in.

Without it applied, the flaky behavior seem to disappear.

I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.

Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.

simon-king-jena commented 12 years ago
comment:74

Replying to @jpflori:

Good point about #11521, I'd say because that what I was firstly interested in.

Without it applied, the flaky behavior seem to disappear.

Good!

I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.

OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.

Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.

Yep. Two potential solutions:

  1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive).
  2. Play with the idea to have a strong reference on the action but a weak reference from a functor to its domain and codomain.

I'm trying the latter now.

jpflori commented 12 years ago
comment:75

Replying to @simon-king-jena:

Replying to @jpflori:

Good point about #11521, I'd say because that what I was firstly interested in. Without it applied, the flaky behavior seem to disappear.

Good!

I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.

OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.

Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.

Yep. Two potential solutions: 1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive). 2. Play with the idea to have a strong reference on the action but a weak reference from a functor to its domain and codomain. I'm trying the latter now.

Just to summarize, here is the current problem, please correct me if some of the following is wrong: we want to let a codomain (resp. domain) get garbage collected when its only weak reffed outside of the coercion model.

Before the current patch the situation is as follows for actions:

The current patch let weakrefs be used for the keys to the above dictionaries and a weak ref to the corresponding value (which is the action).

The problem is that as the action is only weak reffed everywhere now, it gets garbage collected all the time (to be confirmed).

If it is not, then the codomain (resp. domain) will in turn not get garbage collected, because it will be strongly reffed in the action strongly reffed in the domain (resp. codomain) (to be confirmed).

The problem for the homset patch is slightly different and is being discussed in #11521.

simon-king-jena commented 12 years ago
comment:76

Replying to @simon-king-jena:

  1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive).

That's why:

sage: search_src("register_action")
structure/parent.pyx:1698:            self.register_action(action)
structure/parent.pyx:1791:    cpdef register_action(self, action):
structure/parent.pyx:1841:            sage: R.register_action(act)
structure/parent.pxd:29:    cpdef register_action(self, action)

So, simply register action isn't used at all - which makes me think why some actions are stored in the parent.

simon-king-jena commented 12 years ago
comment:77

I see. register_action is not to be used after any coercion was established.

vbraun commented 12 years ago
comment:78

Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution. Perhaps the testsuite stuff can be used to verify that all parents do that?

simon-king-jena commented 12 years ago
comment:79

Replying to @vbraun:

Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution.

But perhaps a difficult one. The condition that register_action must not be used after defining any coercion is probably there for a reason.

Perhaps the testsuite stuff can be used to verify that all parents do that?

How could it? By hooking into the coercion model, look up any action there and verify that all are registered?

simon-king-jena commented 12 years ago

Attachment: test_orphan_functor.gz

Experimental patch using weak references on domain and codomain of functors

simon-king-jena commented 12 years ago
comment:80

I have posted an experimental patch, that has to be applied on top of attachment: trac715_two_tripledicts.patch.

With the experimental patch, the coercion model stores strong references to the actions (hence, it restores the original behaviour), but functors will only store weak references to their domains and codomains.

Unfortunately, this does not fix the memory leak. But perhaps you want to play with it...

Ah! And I just see that "sage.categories.functor" was the wrong location to do the change.

simon-king-jena commented 12 years ago
comment:81

Or I should say: Action.__domain is not what the action acts on, but it is a groupoid, and is not used. So, forget the experimental patch.

simon-king-jena commented 12 years ago
comment:82

An action of G on S stores direct references to G and to S.

The action is a functor, and as a functor, it additionally stores a reference to Groupoid(G), which stores another reference to G, and to the category of S.

In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action of S.base_ring() on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems).

For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G) is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G.

It starts to be complicated. Time to call it a day...

Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?

jpflori commented 12 years ago
comment:83

Replying to @simon-king-jena:

An action of G on S stores direct references to G and to S. The action is a functor, and as a functor, it additionally stores a reference to Groupoid(G), which stores another reference to G, and to the category of S. In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action of S.base_ring() on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems). For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G) is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G. It starts to be complicated. Time to call it a day... Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?

But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.

In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?) so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...

jpflori commented 12 years ago
comment:84

My timings

So, unless I did something wrong, I do not get any significant slow down...

jpflori commented 12 years ago
comment:85

I got about 3350 sec on top of vanilla sage-4.8.alpha5.

simon-king-jena commented 12 years ago
comment:86

Hi Jean-Pierre,

Replying to @jpflori:

But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.

This is not necessarily the case. You would merely get circular references, and they would not obstruct garbage collection, unless one item in the cycle has a __del__ method.

One problem, however, is that many actions start with ZZ. And if ZZ is contained in the cycle, then it can not be collected, since ZZ will live forever -- but you know that.

In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?)

Yes. And is it really sure that the actions are stored in ZZ?

Anyway. They are stored by ==, and thus only one copy remains alive.

so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...

Yes. And that is a problem that, again, might be solved using weak references.

Namely:

Consider an action A of G on S. Typically, G is immortal (like ZZ), but we are willing to let A and S die if we do not have any "external" strong reference to S. In particular, the existence of A should not be enough to keep S alive.

I think this can be accomplished as follows:

Let us analyse what happens with G, S and A:

  1. G will remain alive forever, even without an external reference. Namely, the coercion cache has a strong reference to A; as a functor, A points to Groupoid(G); Groupoid(G) is strongly cached (unless we use weak caching for UniqueRepresentation) and must have a reference to G. If we decide to use weak caching for UniqueRepresentation, then we would only have a strong reference from G to A and a weak or strong reference from A to G. That would be fine for garbage collection. Anyway, I think keeping G alive will not hurt.
  2. Assume that there is no external reference to S. There is a weak reference to S from the cache in the coercion model, namely as key of the cache. Moreover, there is another weak reference from A to S. Hence, S could be garbage collected.
  3. Assume that there is no external reference to A. If S is garbage collected (see the previous point), then it will remove itself from the coercion cache, and thus the strong reference to A would vanish - it could be collected. But if S is alive, then A will remain alive as well.

However, this is how the experimental patch should work - and it does not fix the leak. Perhaps this is, again, due to caching the homsets? So, we would need the patch from #12215 as well. Difficult topic.

simon-king-jena commented 12 years ago
comment:87

Sorry, I meant "we would need the patch from #11521 as well".

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -21,4 +21,5 @@

 Apply:

-[attachment: trac715_two_tripledicts.patch](https://github.com/sagemath/sage-prod/files/10637817/trac715_two_tripledicts.patch.gz)
+* [attachment: trac715_two_tripledicts.patch](https://github.com/sagemath/sage-prod/files/10637817/trac715_two_tripledicts.patch.gz)
+* [attachment: trac715_weak_action.patch](https://github.com/sagemath/sage-prod/files/10637819/trac715_weak_action.patch.gz)
simon-king-jena commented 12 years ago
comment:88

I have attached another patch, which implements the ideas sketched above. I think it corresponds to what you suggested ("use a weak reference from the action to the domain").

One detail: We have to distinguish between the underlying set, the domain and the codomain of an action. In fact, the new patch only uses a weak reference to the underlying set, and introduces a cdef function (hence, hopefully with little overhead) returning it.

I consider sage-5.0.prealpha0 plus trac11780_unique_auxiliar_polyring.patch (probably not needed) plus attachment: trac715_two_tripledicts.patch plus attachment: trac715_weak_action.patch.

At least sage -t sage/modular/modsym/space.py passes, but I need to run the whole test suite.

The example from the ticket description does not leak. However, if the j-invariant varies, it seems that for each elliptic curve one copy is preserved:

sage: K = GF(1<<55,'t')
sage: for i in range(500):
....:     a = K.random_element()
....:     E = EllipticCurve(j=a)
....:     P = E.random_point()
....:     Q = 2*P
....:     
sage: import gc
sage: gc.collect()
2124
sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
sage: LE = [x for x in gc.get_objects() if  isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
sage: len(LE)
500

In any case, the original leak is fixed with the two patches. The question is whether the second patch suffices to keep actions alive, whether it avoids a regression, and whether all tests pass.

If everything is alright, we may still try to find out where the remaining strong reference to an elliptic curve comes from.

simon-king-jena commented 12 years ago
comment:89

PS: The additional application of #11521 does not suffice to avoid the remaining strong reference to an elliptic curve.

simon-king-jena commented 12 years ago
comment:90

With the two patches applied, I get some doctest errors that seem trivial to fix, and it takes 10905 seconds in total. Now, I am not sure: Originally, I had much less time with unpatched Sage.

But perhaps my computer was in a different state at that time? Jean-Pierre, if I understood correctly, you did not find any significant slowdown, right?

The first (i.e., the "official") patch is enough to fix the leak for the original example. According to Jean-Pierre, the timings are fine, it does not matter whether we have no patch, the official patch only, or the first experimental patch. And according to my own test, it does not matter whether we have the first or the second experimental patch.

So, the further proceeding depends on the following questions:

What is your answer to the questions?