Closed robertwb closed 12 years ago
Meanwhile I am rather desperate: I have not the faintest idea how the segfault occurs.
Therefore I used some debugging function that I registered using sys.settrace(...)
, so that all Python commands in the critical example are written into a file.
I posted logs for the unpatched and the patched version.
There is one obvious difference of the two logs: The hash is called more often in the patched version. Calling the hash is rather inefficient for matrix spaces: Each time when the hash of a matrix space is called, the matrix space's string representation is created, which is slow. I suggest to cache the hash value (like what I did for polynomial rings in #9944), but this should be on a different ticket.
Apart from that, I can't spot an obvious difference. Do you have any clue?
It turns out that using TripleDictById
for the _action_maps cache makes the segfault disappear.
If one uses TripleDict
for _coercion_maps then
sage -t devel/sage-main/sage/modular/modsym/space.py
takes 30 seconds, but if one also uses TripleDictById
then it only takes 23 seconds.
My conclusion:
TripleDict
was buggy: It uses id(...)
for the hash table, but ==
for comparison. I think that had to be fixed.TripleDict
uses hash(...)
for the hash table and ==
for comparison. That should be fine, but (1) it leads to a segfault and (2) it leads to a slowdown. After all, calling hash(...)
is a lot slower than determining the address.TripleDictById
uses id(...)
for the hash table and ... is ...
for comparison. Problem: It would probably not fix the memory leak.However, the fact that using TripleDictById
fixes the segfault makes me wonder: Perhaps the segfault occurs when calling hash(...)
on a parent? Namely, in some cases, and action will already be constructed during initialisation of a parent. But if the hash is determined based on cdef data that aren't initialised, a segfault can easily occur.
I'll investigate that further. In any case, we need to keep an eye on the potential slow-down.
The segfault does not occur while computing a hash. It occurs in line 468 of sage/matrix/matrix_rational_dense.pyx, namely
mpq_mul(y, w._entries[j], self._matrix[j][i])
I also tested, just before that line, that w[j]
and self.get_unsafe(j,i)
(which accesses w._entries[j]
and self._matrix[j],[i]
) works.
At this point, I am at my wits' end. To me, it looks like a change in the way of comparing dictionary keys modifies internals of mpir (IIRC, this is where mpq_mul is defined). gdb can not decipher the core file, and I don't know how valgrind can be used.
What else?
Which patches did you apply? With only trac715_two_tripledicts.patch
applied sage doesn't start.
Replying to @vbraun:
Which patches did you apply? With only
trac715_two_tripledicts.patch
applied sage doesn't start.
What???
According to hg qapplied
, I have
trac_12057_fix_doctests.patch
9138_flat.patch
trac_11319_prime_field_coercion.patch
trac_11319_number_field_example.patch
trac11900_category_speedup_combined.patch
11115_flat.patch
trac_11115_docfix.patch
trac715_two_tripledicts.patch
Remark: I work on openSUSE
, hence, I had to apply #12131 and thus also its dependency #12057. I doubt that the absence of #11115 is responsible for Sage not starting. And all other patches are dependencies.
What error occurs when you start Sage with my patch? If we are lucky, it gives some clue why the segfault in the one doctest occurs.
Best regards,
Simon
PS: I started on top of sage-4.8.alpha3.
Meanwhile I built sage-5.0.prealpha0 and applied #11780 and attachment: trac715_two_tripledicts.patch. Sage starts fine.
So, Volker, what had you have applied when Sage didn't start?
I think I made a progress: I found that the vector space that is part of the crash is not unique! So, the VectorMatrixAction
is defined for a vector space that is equal to but not identical with the vector space it is acting on!
The natural solution is to try and find out why the vector space is not unique. Vector spaces should be created using the VectorSpace
constructor, that relies on a UniqueFactory
. But apparently some very old code is constructing a vector space directly - it wouldn't be the first time that this is causing trouble.
PS: Note that vector spaces with different inner product are considered equal.
sage: V = QQ^5
sage: M = random_matrix(QQ,5,5)
sage: M.set_immutable()
sage: W = VectorSpace(QQ,5,inner_product_matrix=M)
sage: V
Vector space of dimension 5 over Rational Field
sage: W
Ambient quadratic space of dimension 5 over Rational Field
Inner product matrix:
[ 0 1/2 1 -1 -1]
[ 0 0 0 1 -1/2]
[ -2 0 0 0 0]
[ 1 0 2 0 0]
[ 0 -2 0 1 0]
sage: V==W
True
sage: type(V)==type(W)
False
But this is not the problem here: The two equal vector spaces involved in the crash have default inner product.
The non-uniqueness makes me think of another potential solution: The coercion model has a method "verify_action". This is only called when a new action is found, but not when an action is taken from the cache.
So, in addition to fixing the non-unique vector space in the modular symbols code, one could always verify the action. Probably this would be too slow, though.
Aha! We have a sparse versus a dense vector space! Here is our problem!
I did manage to install it and reproduce the crash. The core dump shows that the stack is completely corrupted before we called into gmp code.
Hi Volker,
good that you managed to install it. Meanwhile I think I can debug it without the core dump - I think mistaking a sparse with a dense vector space is a pretty convincing reason for a segfault.
However, I hate that old code!!
I tried verify_action
, but then hundreds of tests fail in sage/modular/modsym/space.py. So, apparently it is very common to have non-unique parents in such a way that the action can not be fixed!
For example, I see errors like
TypeError: Coercion of [Infinity] - [0] (of type <class 'sage.modular.modsym.boundary.BoundarySpaceElement'>) into Space of Boundary Modular Symbols for Congruence Subgroup Gamma0(43) of weight 2 and over Rational Field not (yet) defined.
Anyway, verify_action
is no solution.
Hi all,
Just wanted to say I had no problem installing the new patch on top of sage.4.8.alpha5 with tickets #9138 #11900 #1115 #715 and #11521 in that order and Sage launches. I'll start a make ptestlong now.
Hi Jean-Pierre,
don't start ptestlong - I am about to update the new patch such that the segfault does not occur and the time for executing the test is fine and the memleak is gone!
Use weak references to the keys of TripleDict
. Compare by "==" or by "is", depending on the application. Use weak references for storing actions.
Changed work issues from fix doctests to none
Description changed:
---
+++
@@ -18,3 +18,7 @@
Moreover there is at least also references to A in the IntegerMulAction and one in ZZ._action_hash.
So weak refs should be used in all these places if it does not slow things too much.
+
+Apply:
+
+[attachment: trac715_two_tripledicts.patch](https://github.com/sagemath/sage-prod/files/10637817/trac715_two_tripledicts.patch.gz)
Attachment: trac715_two_tripledicts.patch.gz
See the updated patch:
Apply trac715_two_tripledicts.patch
Here some remarks on the new patch:
I use TripleDictById
for storing actions, since otherwise we have trouble with non-unique parents and get segfaults.
In addition, I do not directly store the action but only a weak reference to it, since otherwise I couldn't fix the memory leak.
Sometimes, the stored action is in fact None
, for which we can't use a weak references. Instead, I use a constant function. For technical reasons it returns False and not None (namely, this is to avoid confusion with a weak reference that has become invalid).
Features
sage -t sage/modular/modsym/space.py
is gone.Thus, needs, review.
Ok, I'll give the new patch a go and report after make ptestlong and checking for the memleaks.
Replying to @jpflori:
Ok, I'll give the new patch a go and report after make ptestlong
So do I.
I guess at least one thing is needed: Provide a doc test that demonstrates the fix of the memory leak. This should be similar to the example for the patch that I have posted at #11521. Note that #11521 is in fact a duplicate: The examples from the two ticket descriptions are almost identical.
Actions have strong references to domain and codomain, so its no surprise that they keep their coercion cache entry alive. But I don't understand how storing a weak reference to the action can work; Nothing else keeps the action alive unless it happens to be used while the garbage collector is running. So actions are essentially not cached any more. It seem that either actions should only store weak references to domain/codomain or we implement some ring buffer that keeps the last N coerce maps unconditionally alive.
In fact, the action's reference to domain and codomain seem to be for convenience only. After all you know domain and codomain when you constuct the action and when you pick it from the cache, so there shouldn't be much incentive to look it up. Perhaps it would be easy to make them weak refs, did you look into that?
I agree with Volker and would like to test putting weak refs to domain and codomain in Functor as I suggested in #11521 and letting an option to use strong ref by default so that a user building an action but not storing its domain elsewhere won't see it disappear magically.
Unfortunately I do not have much time to do anything more than testing till the end of the week.
I wouldn't use weak references for anything but caching. In particular, having a weak reference from a functor to its domain or codomain seems a no-go to me.
In one point I agree: There should be a mechanism to keep an action alive as long as domain and codomain exist. But perhaps this is already the case? Isn't there an action cache as an attribute of any parent? And isn't the action stored there (and not only in the cache of the coercion model) when an action is discovered?
So, before thinking of a weak reference from the functor to domain and codomain, I would first test whether the problem you describe actually occurs.
Just two mental notes:
One test in sage/structure/coerce.pyx fails, because it explicitly uses the action cache (ignoring the fact that it now contains weak references and not actions).
And: The long tests of these two files
devel/sage/sage/graphs/graph_generators.py
devel/sage/sage/graphs/generic_graph.py
take 10 minutes each. Is my patch to blame, or has it been like that before?
With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:
Oops, this should be more readable:
With sage4.8.alpha5 plus #9138 #11900 #1115 #715 and #11521 applied some tests fail, namely in the files:
sage.rings.padic.padic_base_generic_element.pyx (1 test failed) but did NOT when rerun and did again the next time,
sage.rings.number_field.number_field.py (1) and did NOT again,
sage.structure.coerce.pyx (5) and did again,
sage.algebras.quatalg.quaternion_algebra.py (1) and did again once and then did NOT again,
lots of them in sage.homology.* (20+25+50+93+1) and did again.
The random behavior of some of the above tests fails with:
IndexError?: list index out of range (padic)
Attribute Error: QuaternionAlgebra?_abstract_with_category object has no attribute _a (quatalg) and at some point in the stack TripleDicts? of the coercion model are present.
For info, the number_field test also fails with an "IndexError: list out of range".
The flaky behaviour probably means that sometimes something gets garbage collected when it shouldn't.
But why do you have the patch from #11521 applied?
Note that with sage-5.0.prealpha0 + #11780 + the new patch from here, I get two tests with errors, namely
sage -t --long -force_lib devel/sage/sage/structure/coerce_dict.pyx # 1 doctests failed
sage -t --long -force_lib devel/sage/sage/structure/coerce.pyx # 5 doctests failed
However, the tests took rather long in total: 12100 seconds with the new patch versus 4569 seconds unpatched.
I think the regression is not acceptable.
Well, perhaps you are right and we should experiment with weak references on domain and codomain.
Work Issues: avoid regression
Good point about #11521, I'd say because that what I was firstly interested in.
Without it applied, the flaky behavior seem to disappear.
I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.
Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.
Replying to @jpflori:
Good point about #11521, I'd say because that what I was firstly interested in.
Without it applied, the flaky behavior seem to disappear.
Good!
I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.
OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.
Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.
Yep. Two potential solutions:
I'm trying the latter now.
Replying to @simon-king-jena:
Replying to @jpflori:
Good point about #11521, I'd say because that what I was firstly interested in. Without it applied, the flaky behavior seem to disappear.
Good!
I'll post timings with all patches, with all patches except for #11521, and with no patches in a few hours.
OK, but I guess the timings I provided should be enough to show that the patch can not remain as it is now.
Anyway I guess Volker is right and even with just #715 applied we should check that actions do not get garbage collected continuously as your timings suggest.
Yep. Two potential solutions: 1. Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive). 2. Play with the idea to have a strong reference on the action but a weak reference from a functor to its domain and codomain. I'm trying the latter now.
Just to summarize, here is the current problem, please correct me if some of the following is wrong: we want to let a codomain (resp. domain) get garbage collected when its only weak reffed outside of the coercion model.
Before the current patch the situation is as follows for actions:
when an action is resolved, it is cached in a triple dict in the coercion model with the domain and codomains as keys
the action is also cached in the dictionnaries in the domain and the codomain much in the same way
there is also a similar cache for homsets
The current patch let weakrefs be used for the keys to the above dictionaries and a weak ref to the corresponding value (which is the action).
The problem is that as the action is only weak reffed everywhere now, it gets garbage collected all the time (to be confirmed).
If it is not, then the codomain (resp. domain) will in turn not get garbage collected, because it will be strongly reffed in the action strongly reffed in the domain (resp. codomain) (to be confirmed).
The problem for the homset patch is slightly different and is being discussed in #11521.
Replying to @simon-king-jena:
- Find out why apparently not all actions are registered in the parent (because then we would have a strong reference as long as at least the domain is alive).
That's why:
sage: search_src("register_action")
structure/parent.pyx:1698: self.register_action(action)
structure/parent.pyx:1791: cpdef register_action(self, action):
structure/parent.pyx:1841: sage: R.register_action(act)
structure/parent.pxd:29: cpdef register_action(self, action)
So, simply register action isn't used at all - which makes me think why some actions are stored in the parent.
I see. register_action is not to be used after any coercion was established.
Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution. Perhaps the testsuite stuff can be used to verify that all parents do that?
Replying to @vbraun:
Just as a remark from the side lines, it seems that consistently storing a reference in the parent would be the cleanest solution.
But perhaps a difficult one. The condition that register_action
must not be used after defining any coercion is probably there for a reason.
Perhaps the testsuite stuff can be used to verify that all parents do that?
How could it? By hooking into the coercion model, look up any action there and verify that all are registered?
Attachment: test_orphan_functor.gz
Experimental patch using weak references on domain and codomain of functors
I have posted an experimental patch, that has to be applied on top of attachment: trac715_two_tripledicts.patch.
With the experimental patch, the coercion model stores strong references to the actions (hence, it restores the original behaviour), but functors will only store weak references to their domains and codomains.
Unfortunately, this does not fix the memory leak. But perhaps you want to play with it...
Ah! And I just see that "sage.categories.functor" was the wrong location to do the change.
Or I should say: Action.__domain
is not what the action acts on, but it is a groupoid, and is not used. So, forget the experimental patch.
An action of G on S stores direct references to G and to S.
The action is a functor, and as a functor, it additionally stores a reference to Groupoid(G)
, which stores another reference to G, and to the category of S.
In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action of S.base_ring()
on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems).
For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G)
is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G.
It starts to be complicated. Time to call it a day...
Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?
Replying to @simon-king-jena:
An action of G on S stores direct references to G and to S. The action is a functor, and as a functor, it additionally stores a reference to
Groupoid(G)
, which stores another reference to G, and to the category of S. In some cases, the category of S will store references to the base ring of S (for example, if S is an algebra), which might have a pointer back to S (for example if the action ofS.base_ring()
on S was registered during initialisation). In this case, we are lost, since categories are unique parents and thus strongly cached (unless we apply #12215, which poses some problems). For the same reason, creating the groupoid of G will result in an eternal reference on G (Groupoid(G)
is strongly cached and it points to G). So, the best that we can hope for is that we can free S at some point, but we will never be able to free G. It starts to be complicated. Time to call it a day... Perhaps the idea to register actions in the parents (in addition to a weak cache in the coercion model) is better?
But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.
In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?) so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...
I got about 3350 sec on top of vanilla sage-4.8.alpha5.
Hi Jean-Pierre,
Replying to @jpflori:
But if you store the actions in both parents (with strong references), you will never be able to free any of the two domain and codomain.
This is not necessarily the case. You would merely get circular references, and they would not obstruct garbage collection, unless one item in the cycle has a __del__
method.
One problem, however, is that many actions start with ZZ
. And if ZZ
is contained in the cycle, then it can not be collected, since ZZ
will live forever -- but you know that.
In the ticket example for example you would get a strong reference to the action in the ZZ cache (which will hopefully never get deleted) (in fact that is what is happening with the current Sage version anyway, isn't that strange according to what you posted, because I guess is already initialized once the for loop is executed?)
Yes. And is it really sure that the actions are stored in ZZ
?
Anyway. They are stored by ==
, and thus only one copy remains alive.
so the elliptic curves (in the ticket example you only get one stored in that cache because comarison was made with "==", if you let the j invariant change within the for loop you would get a growing number of curves in that cache) will stay strongly refed forever as well...
Yes. And that is a problem that, again, might be solved using weak references.
Namely:
Consider an action A of G on S. Typically, G is immortal (like ZZ
), but we are willing to let A and S die if we do not have any "external" strong reference to S. In particular, the existence of A should not be enough to keep S alive.
I think this can be accomplished as follows:
Let us analyse what happens with G, S and A:
Groupoid(G)
; Groupoid(G)
is strongly cached (unless we use weak caching for UniqueRepresentation
) and must have a reference to G. If we decide to use weak caching for UniqueRepresentation
, then we would only have a strong reference from G to A and a weak or strong reference from A to G. That would be fine for garbage collection. Anyway, I think keeping G alive will not hurt.However, this is how the experimental patch should work - and it does not fix the leak. Perhaps this is, again, due to caching the homsets? So, we would need the patch from #12215 as well. Difficult topic.
Sorry, I meant "we would need the patch from #11521 as well".
Description changed:
---
+++
@@ -21,4 +21,5 @@
Apply:
-[attachment: trac715_two_tripledicts.patch](https://github.com/sagemath/sage-prod/files/10637817/trac715_two_tripledicts.patch.gz)
+* [attachment: trac715_two_tripledicts.patch](https://github.com/sagemath/sage-prod/files/10637817/trac715_two_tripledicts.patch.gz)
+* [attachment: trac715_weak_action.patch](https://github.com/sagemath/sage-prod/files/10637819/trac715_weak_action.patch.gz)
I have attached another patch, which implements the ideas sketched above. I think it corresponds to what you suggested ("use a weak reference from the action to the domain").
One detail: We have to distinguish between the underlying set, the domain and the codomain of an action. In fact, the new patch only uses a weak reference to the underlying set, and introduces a cdef function (hence, hopefully with little overhead) returning it.
I consider sage-5.0.prealpha0 plus trac11780_unique_auxiliar_polyring.patch (probably not needed) plus attachment: trac715_two_tripledicts.patch plus attachment: trac715_weak_action.patch.
At least sage -t sage/modular/modsym/space.py
passes, but I need to run the whole test suite.
The example from the ticket description does not leak. However, if the j-invariant varies, it seems that for each elliptic curve one copy is preserved:
sage: K = GF(1<<55,'t')
sage: for i in range(500):
....: a = K.random_element()
....: E = EllipticCurve(j=a)
....: P = E.random_point()
....: Q = 2*P
....:
sage: import gc
sage: gc.collect()
2124
sage: from sage.schemes.generic.homset import SchemeHomsetModule_abelian_variety_coordinates_field
sage: LE = [x for x in gc.get_objects() if isinstance(x,SchemeHomsetModule_abelian_variety_coordinates_field)]
sage: len(LE)
500
In any case, the original leak is fixed with the two patches. The question is whether the second patch suffices to keep actions alive, whether it avoids a regression, and whether all tests pass.
If everything is alright, we may still try to find out where the remaining strong reference to an elliptic curve comes from.
PS: The additional application of #11521 does not suffice to avoid the remaining strong reference to an elliptic curve.
With the two patches applied, I get some doctest errors that seem trivial to fix, and it takes 10905 seconds in total. Now, I am not sure: Originally, I had much less time with unpatched Sage.
But perhaps my computer was in a different state at that time? Jean-Pierre, if I understood correctly, you did not find any significant slowdown, right?
The first (i.e., the "official") patch is enough to fix the leak for the original example. According to Jean-Pierre, the timings are fine, it does not matter whether we have no patch, the official patch only, or the first experimental patch. And according to my own test, it does not matter whether we have the first or the second experimental patch.
So, the further proceeding depends on the following questions:
The experimental patches provide two different approaches to fix a potential problem, namely actions being deallocated when they are still needed. However, is this potential problem a real problem? Only then would it make sense to consider the experimental patches!
Do we also want to fix the leak in the more difficult example, namely when the j-invariant varies? In this case, we need to find out why the actions are registered in ZZ. It is not clear yet whether one really needs one of the experimental patches to get rid of it.
What is your answer to the questions?
Here is a small example illustrating the issue.
The memory footprint of the following piece of code grows indefinitely.
E and P get deleted, but when 2*P is computed, the action of integers on A, the abelian group of rational points of the ellitpic curve, gets cached in the corecion model.
A key-value pair is left in coercion_model._action_maps dict:
(ZZ,A,*) : IntegerMulAction
Moreover there is at least also references to A in the IntegerMulAction and one in ZZ._action_hash.
So weak refs should be used in all these places if it does not slow things too much.
To be merged with #11521. Apply:
and then the patches from #11521.
Depends on #13145 Depends on #13741 Depends on #13746 Depends on #11521
Dependencies: #13145, #13741, #13746, to be merged with #11521
CC: @jpflori @zimmermann6 @vbraun @robertwb @nbruin @malb @orlitzky
Component: coercion
Keywords: weak cache coercion Cernay2012
Author: Simon King, Jean-Pierre Flori
Reviewer: Jean-Pierre Flori, Simon King, Nils Bruin
Merged: sage-5.5.beta0
Issue created by migration from https://trac.sagemath.org/ticket/715