sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.47k stars 486 forks source link

Make libsingular multivariate polynomial rings collectable #13447

Open nbruin opened 12 years ago

nbruin commented 12 years ago

Presently, #715 + #11521 help not permanently keeping parent in memory. In the process we uncovered a hard-but-consistently triggerable problem with the collection of MPolynomialRing_libsingular. We have only observed the problem on bsd.math.washington.edu, MacOSX 10.6 on x86_64.

The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to bsd.math)

Depends on #11521

CC: @simon-king-jena @malb @vbraun @gagern @robertwb @sagetrac-ylchapuy @jpflori @burcin

Component: memleak

Author: Simon King

Branch/Commit: public/make_libsingular_multivariate_polynomial_rings_collectable @ b4df239

Reviewer: Travis Scrimshaw

Issue created by migration from https://trac.sagemath.org/ticket/13447

simon-king-jena commented 12 years ago

Reviewer: Simon King

simon-king-jena commented 12 years ago
comment:42

Hooray!

All tests passed!
Total time for all tests: 1128.9 seconds

And I did not notice any "ignored" errors.

I think I can give Nils' part of the patches a positive review, and suggest a cross-review.

nbruin commented 12 years ago

Attachment: trac_13447-refcount_cleanup.patch.gz

clean up final bits in refcounting of singular rings

nbruin commented 12 years ago

Description changed:

--- 
+++ 
@@ -9,5 +9,6 @@
 * [attachment: trac_13447-consolidated_refcount.patch](https://github.com/sagemath/sage-prod/files/10656328/trac_13447-consolidated_refcount.patch.gz)
 * [attachment: trac_13447-modulus_fix.patch](https://github.com/sagemath/sage-prod/files/10656329/trac_13447-modulus_fix.patch.gz)
 * [attachment: trac_13447-rely_on_singular_refcount.patch](https://github.com/sagemath/sage-prod/files/10656330/trac_13447-rely_on_singular_refcount.patch.gz)
+* [attachment: trac_13447-refcount_cleanup.patch](https://github.com/sagemath/sage-prod/files/10656331/trac_13447-refcount_cleanup.patch.gz)

 **Merge together with** #715, #11521
nbruin commented 12 years ago
comment:43

Patches look good to me, so a positive review from me for Simon's patches.

I've attached a further patch to clean up some more details. Doctests pass for me, but Simon should look at it and decide if he's happy with it. Please go ahead and amend as you see necessary. I also have no problem with the patches being unified (we have some very small patches here now)

Apply trac_13447-consolidated_refcount.patch trac_13447-modulus_fix.patch trac_13447-rely_on_singular_refcount.patch trac_13447-refcount_cleanup.patch

simon-king-jena commented 12 years ago
comment:44

Concerning the last patch: I think one should rather use (as it is "officially" recommended in the docs) singular_ring_reference and singular_ring_delete, and not

    if currRingHdl.data.uring!= currRing:  
        currRingHdl.data.uring.ref -= 1  
        if currRingHdl.data.uring.ref == 0:  
            rDelete(currRingHdl.data.uring)  
        currRingHdl.data.uring = currRing # ref counting?  
        currRingHdl.data.uring.ref += 1  

Apart from that, I agree that the ring wrappers can be binned.

A bit later today, I will produce a combined patch, and provided that all tests pass I will set it to positive review.

simon-king-jena commented 12 years ago
comment:45

For the record: Apparently it is impossible to use the "proper" singular_ring_delete function in the code snipped mentioned in comment:44. I guess the problem is that singular_ring_delete will set currRing=None in certain situations - hence, the next line currRingHdl.data.uring = singular_ring_reference(currRing) will fail with a segfault.

So, I'll leave this code snipped as it is.

simon-king-jena commented 12 years ago
comment:46

On the other hand: I'd really like to understand why it doesn't work with singular_ring_delete in this case.

I found that the segfault occurs inside singular_ring_delete(currRingHdl.data.uring) when attempting to change to currRingHdl.data.uring. So, apparently, currRingHdl.data.uring is invalid at that point. And this is hardly surprising, because it is created by the following lines, which look like a hack to me:

        if currRingHdl == NULL:
            import sys
            currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1)
            currRingHdl.data.uring.ref += 1
simon-king-jena commented 12 years ago
comment:47

Now I think I understand:

currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1) has the purpose to create currRingHdl (it is only called if currRingHdl==NULL), but the value assigned to currRingHdl.data.uring is invalid. In particular, the attempt to change to currRingHdl.data.uring results in a segfault. Hence, singular_ring_delete(currRingHdl.data.uring) would not work.

Solution: We create currRingHdl, but we immediately delete the invalid currRingHdl.data.uring and replace it by a new reference to currRing. If there is no currRing then I print a warning, but I hope that we will never see that warning...

simon-king-jena commented 12 years ago
comment:48

I have attached a combined patch. Nils, are you happy with what I did with currRingHdl in sage/libs/singular/function.pyx? If you are, and if the tests pass, then please set it to "positive review"

For the record: I have

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
13145.patch
trac_13447-sanitise_ring_refcount.patch

on top of sage-5.4.beta0 Edit and the tests both in sage/rings/polynomial and sage/libs/singular pass. Let's hope for the full test suite!

Apply trac_13447-sanitise_ring_refcount.patch

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -4,11 +4,6 @@

 **Apply**

-#13145, #715, #11521, and then
-
-* [attachment: trac_13447-consolidated_refcount.patch](https://github.com/sagemath/sage-prod/files/10656328/trac_13447-consolidated_refcount.patch.gz)
-* [attachment: trac_13447-modulus_fix.patch](https://github.com/sagemath/sage-prod/files/10656329/trac_13447-modulus_fix.patch.gz)
-* [attachment: trac_13447-rely_on_singular_refcount.patch](https://github.com/sagemath/sage-prod/files/10656330/trac_13447-rely_on_singular_refcount.patch.gz)
-* [attachment: trac_13447-refcount_cleanup.patch](https://github.com/sagemath/sage-prod/files/10656331/trac_13447-refcount_cleanup.patch.gz)
+#13145, #715, #11521, and then [attachment:trac_13447-sanitise_ring_refcount.patch

 **Merge together with** #715, #11521
simon-king-jena commented 12 years ago
comment:49

make ptest succeeded on bsd.math! To be on the safe side, I am now testing make ptestlong, but if you (Nils) is happy with my changes to currRingHdl then please finish the review!

simon-king-jena commented 12 years ago
comment:50

And make ptestlong succeeded as well. Looks good, I think.

nbruin commented 12 years ago
comment:51

I'm not so sure that

    currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1) 
    currRingHdl.data.uring.ref += 1

is entirely safe. In 'libsing-test2.cc' (in the Singular package, the following similar code is used:

    // prepare the arguments
    // create a ring Q[x,y]
      // the variable names
      char **n=(char**)omAlloc(2*sizeof(char*));
      n[0]=(char*)"x";
      n[1]=(char*)"y";
      // create the ring
      ring R=rDefault(0,2,n);
      // n is not needed any more:
      omFreeSize(n,2*sizeof(char*));
    // make it the default ring, also for the interpeter
    idhdl newRingHdl=enterid("R" /* ring name*/,
                           0, /*nesting level, 0=global*/
                           RING_CMD,
                           &IDROOT,
                           FALSE);
    IDRING(newRingHdl)=R;
    rSetHdl(newRingHdl);

I'd assume IDRING(newRingHdl) is newRingHdl.data.uring, so the assignment above does not assume anything about that field being initialized. So who knows where newRingHdl.data.uring.ref points!

I get the impression that indeed, currRing is set to something valid nearly always, so your assumption that it is during SingularFunction.__init__ (that's earlier than calling!) might be OK. The singular code does contain an awful lot of currRing!=NULL checks though, so I'm not so sure that's really guaranteed.

OK, a little searching of the Singular source shows that enterid (which takes 6 arguments but we call it with 5 and so does the above example) apparently calls idrec::set, in our case with init=1 and the example above with init=0. The init=1 causes a call idrecDataInit and for a ring command that does omAlloc0Bin(sip_sring_bin). The resulting pointer gets indeed assigned to what I think is ...data.uring (although not in a very typerespecting way, but I guess that's par for C code).

So my guess is that by calling with init=1 we indeed get a block that we can delete. Since that's all we do with it I think the cleaner way for us would be

            currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0)
            currRingHdl.data.uring = NULL

(or whatever is the appropriate value for uring), as per the libsing example above)

    if currRingHdl.data.uring!= currRing: 
        if currRingHdl.data.uring != NULL:
            singular_ring_delete(currRingHdl.data.uring)
        currRingHdl.data.uring = currRing
        singular_ring_reference(currRing) 

(or indeed just leave init=1 in place so that currRingHdl.data.uring is never NULL. In any case, this happens at most once in any sage session, so the leaked memory (if any) would be minuscule.

Concerning refcounts: I found the following in dyn_modules/python/ring_wrap.h

#include <boost/intrusive_ptr.hpp>
using namespace boost;
//typedef intrusive_ptr<ip_sring> Ring;
// inline void intrusive_ptr_add_ref(ring r){
//     r->ref++;
// }
// inline void intrusive_ptr_release(ring r){
//     r->ref--;
//     if (r->ref<=0) rDelete(r);
// }

which is in step with our use of the ref field. But the code is commented out! There is such a thing as an intrusive_ptr in Boost and I think it needs exactly those two functions declared to function. If this is what is intended then we're OK. If this is commented out because now a different scheme is used, we may be in trouble.

EDIT: kernel/Number.h contains uncommented definitions:

using namespace boost;
inline void intrusive_ptr_add_ref(ring r){
    r->ref++;
    //Print("ref count after add: %d", r->ref);
}
inline void intrusive_ptr_release(ring r){
    if (r->ref<=0) rDelete(r);
    else {
    r->ref--;

    }
    //Print("ref count after release: %d", r->ref);
}

I find the ptr_release a little worrisome: Do the singular people believe that r->ref ==0 means that there is still a reference? Do they really want to count references "-1"-based? I guess one can count it as "additional" references. I'm afraid they might. From Singular/ipshell.cc:

void rKill(ring r)
{
  if ((r->ref<=0)&&(r->order!=NULL))
  {
...
    rDelete(r);
    return;
  }
  r->ref--;
}

void rKill(idhdl h)
{ 
  ring r = IDRING(h);
  int ref=0;
  if (r!=NULL)
  {
    ref=r->ref;
    rKill(r);
  }
  if (h==currRingHdl)
  { 
    if (ref<=0) { currRing=NULL; currRingHdl=NULL;}
    else
    {
      currRingHdl=rFindHdl(r,currRingHdl,NULL);
    }
  }
}

this code all indicates that special action is required only if ref<=0 'before' decreasing. Can you check with singular people if that's the correct usage of the ref field? If we use it in a different way I image nasty bugs could arise later (our use would prevent singular from ever deleting a ring we've had our hands on (if it would ever do that) and if we get a ring that singular initially constructed, we could delete it prematurely, due to an off-by-one.

In that case we should probably ensure that singular_ring_ref and singular_delete_ring are somehow aliased to boost::intrusive_ptr_add_ref and intrusive_ptr_release from kernel/Number.h.

nbruin commented 12 years ago
comment:52

libSingular experts: Does ring->ref == 0 mean there is exactly one reference to the ring still active?

simon-king-jena commented 12 years ago
comment:53

Replying to @nbruin:

I'm not so sure that

    currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1) 
    currRingHdl.data.uring.ref += 1

is entirely safe.

I am sure that it is not safe: One can't even call singular_ring_delete on it. That's why I removed it, unless currRing==NULL.

EDIT: kernel/Number.h contains uncommented definitions:

using namespace boost;
inline void intrusive_ptr_add_ref(ring r){
    r->ref++;
    //Print("ref count after add: %d", r->ref);
}
inline void intrusive_ptr_release(ring r){
    if (r->ref<=0) rDelete(r);
    else {
    r->ref--;

    }
    //Print("ref count after release: %d", r->ref);
}

I find the ptr_release a little worrisome: Do the singular people believe that r->ref ==0 means that there is still a reference?

According to the code snipped, the ring is deleted if r->ref==0. So, apparently r->ref==0 means that there is no reference.

this code all indicates that special action is required only if ref<=0 'before' decreasing.

If ref<=0 then no reference is left, hence, the ring is deleted. Otherwise, the reference counter is decremented. I really don't understand the problem.

In that case we should probably ensure that singular_ring_ref and singular_delete_ring are somehow aliased to boost::intrusive_ptr_add_ref and intrusive_ptr_release from kernel/Number.h.

Singular tests whether ref<=0; if it is, then the ring is deleted, but if it isn't then the counter is decremented.

We first decrement the counter, and then test whether ref<0; if it is, then the ring is deleted, but if it isn't then no further action is taken.

Isn't that logically the same?

nbruin commented 12 years ago
comment:54

Replying to @simon-king-jena:

According to the code snipped, the ring is deleted if r->ref==0. So, apparently r->ref==0 means that there is no reference.

Indeed, but this is tested before decrement. So for this to work, you'd need

   R=create_ring_with_appropriate_refcount
   // here R.ref should be 0
   //for the following to cause deletion.
   intrusive_ptr_release(R)

We initialize R.ref=1 in singular_ring_new.

If ref<=0 then no reference is left, hence, the ring is deleted. Otherwise, the reference counter is decremented. I really don't understand the problem.

If ref<=0 BEFORE the reference count is decreased. The difference is that intrusive_ptr_release happily leaves a ring in memory that AFTER the call has R.ref==0. Our singular_ring_delete will only abstain from calling rDelete(doomed) if after the call we have doomed.ref > 0.

    401     doomed.ref = doomed.ref-1 
    402     if doomed.ref > 0: 
    403         return 

The two behaviours are off-by-one.

Singular tests whether ref<=0; if it is, then the ring is deleted, but if it isn't then the counter is decremented.

We first decrement the counter, and then test whether ref<0; if it is, then the ring is deleted, but if it isn't then no further action is taken.

We don't. We test whether ref<=0 (or rather whether ref>0 to see if we should do an early exit).

As our doctests show, apparently we're never calling singular_ring_delete on rings we haven't initialized the ref field on ourselves and that singular expects to to survive after we do, since that would probably lead to an error (Singular would try to access a deallocated ring).

A mismatch in the other direction would be milder: Singular takes a reference on a ring we initialized the ref field on, we lose our last reference via a singular_ring_delete (no rDelete gets called because Singular correctly increased the refcount). However, Singular will never delete the ring because we originally initialized the ref field to 1 whereas Singular apparently expects it to be initialized to 0.

nbruin commented 12 years ago
comment:55

Replying to @simon-king-jena:

I am sure that it is not safe: One can't even call singular_ring_delete on it. That's why I removed it, unless currRing==NULL.

I think that's because singular_ring_delete does a changeCurrentRing on it. Calling rDelete seems to work just fine. I think that thanks to the init=1 the memory for the ring is allocated, but initialized to 0 thanks to the omAlloc0. My failure to locate a ref = 1 anywhere in Singular's code also makes me believe that they're happy giving back a ring where ref = 0.

I'm not so sure the rChangeCurrRing dance is necessary in singular_ring_delete. The code in intrusive_ptr_release doesn't need it. I suspect Volker put it in as an extra precaution while he was trying to debug other issues. A cursory reading of the code of rDelete doesn't seem to indicate its operation is affected in any way by whether the current ring is equal to the one being deleted.

simon-king-jena commented 12 years ago
comment:56

OK, I asked Hans Schönemann. I hope he'll answer soon.

If r->ref==0 really means in Singular that there is exactly one reference left, then we should act accordingly.

simon-king-jena commented 12 years ago

Changed work issues from Input from a libsingular expert to none

simon-king-jena commented 12 years ago
comment:57

Hans Schönemann gave me a couple of answers - many thanks! I hope I am translating and summarising correctly:

r->ref counts the number of interpreter variables referencing a ring minus one. Hence:

ring r=.....; // ref is 0
def rr=r;     // ref is 1
kill r;       // ref is 0
kill rr;      // ref is 0 when calling rKill -> delete it.

There are locally generated rings (in std, for example), unbeknownst to the interpreter - they have ref zero.

Singular uses r->ref only in rKill - it makes me wonder why we use rDelete and not rKill, by the way. In fact, rKill deletes local data, but rDelete doesn't.

Generally, one should have currRingHdl.ring == currRing, or currRing==NULL and currRingHdl==NULL. I just notice that he wrote currRingHdl.ring, not currRingHdl.data.uring - is that a difference?

Concerning debugging: If one builds Singular so that OM_NDEBUG is not defined, then a debug version of omalloc is used. In that way, we would have more easily detected the original problem with omStrDup.

Conclusions

I am not sure if you agree with my conclusions, but here we go:

Since r->ref does not play a rôle in libsingular, we are free to use r->ref to count the number of pointers to r (not "number of pointers minus one"). However, when calling singular interpreter functions, we must make sure that r->ref>0. With our patch, we already do so - hence, that's fine.

Since we apply singular_ring_delete to non-commutative (quotient) rings, and we do not call rKill but only rDelete, we currently have a memory leak for non-commutative rings. rKill will first test that r->ref==0, then kills local data, then kills the ring, and sets currRing and friends to NULL if the to-be-deleted ring is currRing. Hence, we should use rKill in singular_ring_delete, but probably without the rChangeCurrRing dance.

If currRing==NULL, we should not create currRingHdl. If currRing!=NULL then we should let uring point to it. the latter is already done in my patch, the former should be done.

So far I forgot to ask Hans about enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1).

simon-king-jena commented 12 years ago

Work Issues: Replace rDelete by rKill. Only create currRingHdl if currRing is not null

simon-king-jena commented 12 years ago
comment:59

Replying to @simon-king-jena:

Since we apply singular_ring_delete to non-commutative (quotient) rings, and we do not call rKill but only rDelete, we currently have a memory leak for non-commutative rings.

Or perhaps we do not have a leak. Namely, the internal data are also referenced via python, and thus will be deleted when the ring is deleted. Well, let's see if rKill results in segfaults (due to the attempt to deallocate the internal data twice)...

simon-king-jena commented 12 years ago
comment:60

I have updated attachment: trac_13447-sanitise_ring_refcount.patch.

Changes:

  1. currRingHdl is only created if currRing!=NULL, and currRingHdl.data.uring is subsequently set to currRing.
  2. singular_ring_delete is now using rKill in lieu of rDelete, so that local data are guaranteed to be deallocated.

In my previous comment, I was a bit sceptical about the second point. However, all doctests pass on bsd.math, hence, using rKill does not result in a segfault. In other words, the local data are not double-freed, und thus rKill is better than rDelete.

Apply trac_13447-sanitise_ring_refcount.patch

simon-king-jena commented 12 years ago

Changed work issues from Replace rDelete by rKill. Only create currRingHdl if currRing is not null to none

simon-king-jena commented 12 years ago
comment:61

Here is one comment of Hans on the following lines of the patch:

          global currRingHdl
          if currRingHdl == NULL and currRing!=NULL:
              # Create an invalid mock ring - it would not be possible
              # to make that ring currRing!
              # The only aim is to create currRingHdl
              currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 1)
              # Now we assign proper data to currRingHdl
              rDelete(currRingHdl.data.uring)
              currRingHdl.data.uring = singular_ring_reference(currRing)

He says that it is ok like that. However, it would be better to do

          global currRingHdl
          if currRingHdl == NULL and currRing!=NULL:
              # Create an invalid mock ring - it would not be possible
              # to make that ring currRing!
              # The only aim is to create currRingHdl
              currRingHdl = enterid("my_awesome_sage_ring", 0,  RING_CMD, &IDROOT, 0)
              # Now we assign proper data to currRingHdl
              currRingHdl.data.uring = singular_ring_reference(currRing)

Note the difference in the last argument of enterid.

I'll try it...

simon-king-jena commented 12 years ago
comment:62

I have updated the patch again. Tests pass on bsd.math.

The difference between enterid(...,0) and enterid(...,1) is that the former does not allocate memory for currRingHdl.data.uring, while the latter does. Hence, using enterid(..., 0), one avoids the useless call to rDelete.

Apply trac_13447-sanitise_ring_refcount.patch

nbruin commented 12 years ago
comment:63

I'm afraid you (at least theoretically) have recreated the possibility for our original error to occur. We essentially have:

SingularFunction.__init__:
    if currRingHdl == NULL and currRing!=NULL: 
            currRingHdl = <create ring handle>
            currRingHdl.data.uring = currRing

SingularFunction.__call__:
    if currRingHdl.data.uring!= currRing: 
        singular_ring_delete(currRingHdl.data.uring) 
        currRingHdl.data.uring = singular_ring_reference(currRing)

If we init with currRing == NULL we may be leaving currRingHdl == NULL. If a subsequent invocation of call happens after currRing has a new value and currRingHdl is not updated (if that would never happen we wouldn't need this code) we'd be calling

    if NULL.data.uring!= currRing: 
        singular_ring_delete(NULL.data.uring) 
        NULL.data.uring = singular_ring_reference(currRing)

which will segfault (OK, this error is much better than the original).

From what you describe, rKill can set currRingHdl=NULL as well, in which case we definitely get an error in the above code. Thus it seems to me it's theoretically possible that __call__ happens when currRingHdl=NULL and currRing!=NULL. In that case you should recreate a handle:

    if currRingHdl != NULL:
        if currRing != NULL:
            if currRingHdl.data.uring != currRing:
                if currRingHdl.data.uring != NULL:
                    singular_ring_delete(currRingHdl.data.uring)
                currRingHdl.data.uring = currRing #I don't think this should up refcounts
        else: #currRing == NULL and currRingHdl != NULL
            rKill(currRingHdl) #delete the handle as well. You could also save it for reuse
                               #I think this looks at the refcount
            currRingHdl = NULL            
    else:
        if currRing != NULL:
            currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0) #note the 0
            currRingHdl.data.uring = currRing
# NOW the following invariant holds:
    assert (currRingHdl == NULL and currRing == NUL) or currRingHdl.data.uring == currRing

I think that without extra assumptions you must be prepared to deal with all those conditions. Also note that you might be creating a new ringhandle an arbitrary number of times, so knowing whether you leak becomes important. The sequence I use above is what i think the libsingular examples that are shipped with singular do.

You can already see here that the use of rKill (which does look at reference counts) forces us to follow Singular's conventions of what the ref field. At this point we're assuming that if currRingHdl holds the last reference to a ring, we should probably delete it.

I think the convention is that currRingHdl (if it's relevant) is pointing to the same ring as currRing, so they are treated as one reference: currRingHdl.data.uring = currRing does not increase a refcount. That obviously leads to problems when currRing diverges from currRingHdl.

Either you attach memory management ONLY to currRingHdl and trust that when currRing is changed, someone else will still have a reference to the original value, who can throw away the ring (so, currRing is only changed by straight assignments -- no increfs or decrefs. It's always a borrowed reference) or you always ensure that currRingHdl gets updated (or at least cleared) when currRing changes. My impression is that refcounting in singular is only an interpreter thing, so that the former is the model used by Singular itself.

Double free implies segfault: Have you verified this is true with omAlloc?

Refcounting base: I agree that as long as our created rings and rings created internally live separate lives, it doesn't matter what semantics we attach to the ref field. I just think it's dangerous to implicitly assume these rings do live separate lives. They do now, but who's to guarantee that they do in the future? And even then it may take a long time for a visible error to occur, which by then might be difficult to track down. I think we should adjust to the semantics Singular uses in other places (and document the slightly unusual refcounting semantics of Singular in our code!) to make the libsingular interface as robust as possible. You know how extensions/modifications are going to happen: people will just copy/paste what's already there.

nbruin commented 12 years ago
comment:64

Replying to @simon-king-jena:

There are locally generated rings (in std, for example), unbeknownst to the interpreter - they have ref zero.

Singular uses r->ref only in rKill - it makes me wonder why we use rDelete and not rKill, by the way. In fact, rKill deletes local data, but rDelete doesn't.

If we do that, we should ensure that rKill has an accurate view of the number of references. That means adjusting to Singular's conventions. If we rKill with our current code, we'd leak because we'd call with a positive refcount. I guess you could manually decref before calling rKill but that's silly and misleading to future maintainers of the code.

Generally, one should have currRingHdl.ring == currRing, or currRing==NULL and currRingHdl==NULL. I just notice that he wrote currRingHdl.ring, not currRingHdl.data.uring - is that a difference?

I don't think the former actually exists. A macro is used in the singular code base, so Hans would not usually spell it out in full.

Concerning debugging: If one builds Singular so that OM_NDEBUG is not defined, then a debug version of omalloc is used. In that way, we would have more easily detected the original problem with omStrDup.

Didn't work for me but I might have just failed to properly (de)activate those flags. Anyway, I doubt an access-after-free would be immediately detected. That really involves unmapping or protecting memory in order to force a segfault. I didn't see evidence that omAlloc reaches that deep into the system. It may be that omAlloc in debugging mode does consistency checks on every use. In that case we might have found the corruption a little earlier. I doubt we would have had gdb point exactly at the offending instruction. I think it would be nice for singular to update the omAlloc headers with my changes so that (at least on linux and OSX) they can build it with the malloc underneath. I found it very useful for this particular issue.

Conclusions Since r->ref does not play a rôle in libsingular, we are free to use r->ref to count the number of pointers to r (not "number of pointers minus one"). However, when calling singular interpreter functions, we must make sure that r->ref>0. With our patch, we already do so - hence, that's fine.

In principle yes, but not if we use rKill and I think it's a bad idea in general because it might cause unforeseen problems in the future. Better to stick with one interpretation of what the ref field means.

If currRing==NULL, we should not create currRingHdl. If currRing!=NULL then we should let uring point to it. the latter is already done in my patch, the former should be done.

As you indicate, currRingHdl can be NULLed again by rKill. We need to correct that if we need currRingHdl later.

In fact, without extra infrastructure to clean up after it, it seems that rKill leaks ringhandles. Is it the case that ring handles are registered in some global data structure that allows them to be deleted? This loop might be the candidate:

    while (r->idroot!=NULL)
    {
      killhdl2(r->idroot,&(r->idroot),r);
    }

that might do this. In any case, rKill just does a straight currRingHdl=NULL if (r==currRing), so currRingHdl is definitely gone.

Is it really necessary to use rKill? It seems it is more part of the interpreter layer and that the libsingular interface itself is largely one level lower. Have we initialized enough of the interpreter to let rKill function as advertised?

It is by now clear that up to now, the Sage-libsingular interface was largely based on empirically verified guesses. Now that Simon has good contact with Hans, perhaps we can ensure the interface follows official protocol?

simon-king-jena commented 12 years ago
comment:65

Replying to @nbruin:

Replying to @simon-king-jena:

Singular uses r->ref only in rKill - it makes me wonder why we use rDelete and not rKill, by the way. In fact, rKill deletes local data, but rDelete doesn't.

If we do that, we should ensure that rKill has an accurate view of the number of references. That means adjusting to Singular's conventions. If we rKill with our current code, we'd leak because we'd call with a positive refcount.

No, we don't. With the current patch, rKill is only called if ref==0.

Conclusions Since r->ref does not play a rôle in libsingular, we are free to use r->ref to count the number of pointers to r (not "number of pointers minus one"). However, when calling singular interpreter functions, we must make sure that r->ref>0. With our patch, we already do so - hence, that's fine.

In principle yes, but not if we use rKill

I disagree, since we only call rKill with r->ref==0.

and I think it's a bad idea in general because it might cause unforeseen problems in the future. Better to stick with one interpretation of what the ref field means.

In a way, we do. Namely, one could argue that a Sage polynomial ring MPolynomialRing_libsingular is an interpreter variable (roughly, it is an object that does not live in the Singular kernel but points to a ring in the Singular kernel). Hence, when creating a polynomial ring, it agrees with Singular's conventions to increase the refcount.

As you indicate, currRingHdl can be NULLed again by rKill. We need to correct that if we need currRingHdl later.

currRingHdl is only NULLed if currRing is deleted. So, that should be fine.

In any case, rKill just does a straight currRingHdl=NULL if (r==currRing), so currRingHdl is definitely gone.

You mean, it doesn't free it before assigning NULL? Yes, that would seem to be a leak.

Is it really necessary to use rKill? It seems it is more part of the interpreter layer and that the libsingular interface itself is largely one level lower.

I'll try rKill versus rDelete in singular_ring_delete and will try to see if there is a leak when creating and deleting super commutative rings (SCA).

simon-king-jena commented 12 years ago
comment:66

Too bad. Apperently an SCA can not be garbage collected, in spite of all the weak caches from #715 and #11521. So, singular_ring_delete will never be called on an SCA.

simon-king-jena commented 12 years ago
comment:67

With #12313 and its dependencies, SCA can be collected, but its underlying ring is not deleted. Wrong refcount, I guess.

simon-king-jena commented 12 years ago
comment:68

The wrong refcount appears to be in sage/libs/singular/function.pyx, where in the current patch we have

            # We need to incref the to-be-converted data,
            # since apparently removing to_convert.data would
            # decref it.
            return new_RingWrap( singular_ring_reference(<ring*> to_convert.data) )

There was some example where the refcount was wrong without calling singular_ring_reference, but for SCA it is wrong with that call.

So, that is something that we need to understand, and needs work.

simon-king-jena commented 12 years ago

Work Issues: Understand why sometimes new_RingWrap needs an incref and sometimes not

simon-king-jena commented 12 years ago
comment:69

Some suggestions:

The singular function rKill seems to do what we want in singular_ring_delete. Hence, to keep it simple, singular_ring_delete should call rKill, and nothing more (unless we want a sanity test, such as printing a warning if singular_ring_delete is called on NULL).

rKill only takes action if r->ref is zero when it is called. Indeed, according to Hans, r->ref is the number of interpreter variables pointing to a ring, minus one. In order to avoid confusion, we should follow the same idea.

So, what means "interpreter variables" for us? I think this should be "any Sage object whose deallocation needs the libsingular ring being alive". This is every polynomial ring, every Gröbner strategy, and any element of a polynomial ring.

Since ref is -1 based, when a Sage ring R is created then ref shall be zero, so that the underlying libsingular ring can be properly deleted if R is deallocated. And if any element or Gröbner strategy on that ring is created, ref needs to be incremented. The deallocation of an element or Gröbner strategy must involve a call to singular_ring_delete.

A complication arises when calling a singular_function. When we call a Singular interpreter function, then we should have ref>0, again according to Hans. Hence, at the begin of a call, ref must be incremented, and decremented when leaving it.

Special care needs to be taken on currRingHdl. According to Hans, rKill can make it NULL, and it should be NULL if and only if currRing is NULL. Currently, we create currRingHdl when we create a Singular function, but we only use it when we call a Singular function. That's asking for trouble. Hence, we should move the creation of currRingHdl to SingularFunction.__call__, where we also set the correct currRing. SingularFunction.__init__ is not the right place.

simon-king-jena commented 12 years ago
comment:70

I think one potential problem with our current use of currRingHdl in combination with rKill is that we would redefine my_awesome_sage_ring, hence, we would see the following message on stderr:

// ** redefining my_awesome_sage_ring **

I'll ask Hans how one can create currRingHdl without creating a mock ring.

simon-king-jena commented 12 years ago
comment:71

Here is one complication for singular_function: If you do singular_function("std"), then a SingularKernelFunction is created. In its __init__, it creates a call handler, and for that purpose it needs to find out the arity of the Singular kernel function.

But apparently the arity of std can not be correctly determined, if currRing==0: One obtains arity 0, not 1, for std.

Therefore, when I move the creation of currRingHdl from SingularFunction.__init__ to where the function is called, an error occurs.

A question arises: Is there a singular function that changes its arity depending on currRing? If it is the case, then we must determine SingularKernelFunction.call_handler each time the function is called, rather than during initialisation.

nbruin commented 12 years ago
comment:72

Replying to @simon-king-jena:

I think one potential problem with our current use of currRingHdl in combination with rKill is that we would redefine my_awesome_sage_ring, hence, we would see the following message on stderr:

// ** redefining my_awesome_sage_ring **

I suspect the records do get cleared, because otherwise Singular would leave a dangling pointer around and one would have what is essentially a singular ring variable with NULL pointer to a ring or a pointer to unallocated memory . If those variables do get properly removed, by the time we have to redefine the thing there should be no trace of the original and hence it would not be a redefinition. If the skeleton somehow does remain (with a nulled out data.uring field then, I suppose?), we could keep it around:

    saved_skeleton=enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0)
    ....
    if currRingHdl == NULL:
        currRingHdl = saved_skeleton
        currRingHdl.data.uring = singular_ring_reference(currRing)

Incidentally, doesn't currRingHdl count as a ring reference? The code in SingularFunction that started this whole thing seems to indicate this. In that case the only singular_ring_delete(R) that can trigger nulling of currRingHdl and currRing is when the explicit intention is singular_ring_delete(currRingHdl.data.uring). When decreffing any other pointer that just happens to be pointing to the same ring structure, there would still be a refcount left, so no actual deletion would take place.

Note that after singular_ring_delete(currRingHdl.data.uring) we should absolutely NULL currRingHdl.data.uring (probably safer/better to do currRingHdl=NULL) even if deallocation didn't occur, because we should remove the reference. Therefore, I find it surprising that rKill does do a currRingHdl=NULL conditionally internally. With correct usage, rKill should never execute that code usefully, because it needs to happen unconditionally outside of rKill if it ever happens. Is it there as a safety net against sloppy programming?

Note that doing singular_ring_delete(currRing) is nearly always wrong, because currRing is a borrowed reference as far as I understand.

Are we the only ones ever setting currRingHdl or can one write/call singular functions that internally will change currRingHdl? If the former holds, the code becomes a little easier to reason about and then we can make our own provisions and ensure they are followed properly (such as saving a skeleton).

In the latter case we really have to play by Singular's rules, because Singular might be doing the deallocating outside of our control.

simon-king-jena commented 12 years ago
comment:73

Replying to @nbruin:

Replying to @simon-king-jena:

I think one potential problem with our current use of currRingHdl in combination with rKill is that we would redefine my_awesome_sage_ring, hence, we would see the following message on stderr:

// ** redefining my_awesome_sage_ring **

I suspect the records do get cleared, because otherwise Singular would leave a dangling pointer around and one would have what is essentially a singular ring variable with NULL pointer to a ring or a pointer to unallocated memory . If those variables do get properly removed, by the time we have to redefine the thing there should be no trace of the original and hence it would not be a redefinition.

I am afraid Singular knows that the mock ring has been defined. OK, that's relative to #12313. But it happened when I did

sage: from sage.rings.polynomial.plural import SCA
sage: E = SCA(QQ, ['x', 'y', 'z'], [0, 1], order = 'degrevlex')
sage: import gc
sage: del E
sage: _ = gc.collect()
sage: E = SCA(QQ, ['x', 'y', 'z'], [0, 1], order = 'degrevlex')

12313 is needed to make E collectable.

Hans didn't answer, but I found out myself: currRingHdl=rFindHdl(currRing,NULL,NULL), which could still be NULL. If it is NULL, then we may create the mock ring.

Are we the only ones ever setting currRingHdl or can one write/call singular functions that internally will change currRingHdl?

No idea.

simon-king-jena commented 12 years ago

Combined patch

simon-king-jena commented 12 years ago
comment:74

Attachment: trac_13447-sanitise_ring_refcount.patch.gz

I have attached a new version of the patch. Here are a few comments:

For easier debugging, I created methods returning the refcount for RingWrap and for polynomial rings, and for g-algebras.

__Concerning singular_function__

currRingHdl is now taken care of when the function is called, and not when it is created. In that way, one prevents nastinesses like currRingHdl being NULL because of a previous deallocation. When creating it currRing is guaranteed to exist. I call rFindHdl(currRing,NULL,NULL), which may find a previously defined thingy. If it can not be found, a mock ring is created by

currRingHdl = enterid("my_awesome_sage_ring", 0, RING_CMD, &IDROOT, 0)

Since the last argument is 0, we can then do

currRingHdl.data.uring = currRing

without the need to call rDelete(currRingHdl.data.uring). Problem: See below.

Since currRing is now not necessarily available when initialising SingularKernelFunction, it is impossible to determine its arity during initialisation, but fortunately it is still possible to determine whether the function exists in the Singular kernel. Therefore, during __init__, I merely test for existence, while the creation of self.call_handler is postponed to a point were currRing is guaranteed to be around.

Since Hans said that we need to make sure that currRing->ref is positive when calling an interpreter function, I temporarily increment the counter during the call.

Concerning ref counting

I think we should simply rely on Singular's rKill for deletion of rings, so that internal data are taken care of.

currRingHdl.data.uring and currRing do not constitute a reference in the sense of "interpreter variable". Therefor I am not incrementing or decrementing the refcount on them (unless I missed something).

Because of ref counting based at -1, a naked MPolynomialRing_libsingular and NCPolynomialRing_plural should have self._ring.ref == 0, so that self._ring can be properly deleted. All other ring-related data, such as a Gröbner strategy or elements of the ring, constitute a new reference.

The same should hold for a RingWrap: If one has a naked RingWrap then the underlying ring should be deletable when deleting RingWrap. A complication arises if both a RingWrap and a polynomial ring refer to the same underlying libsingular ring.

This occurs in NCPolynomialRing_plural, which is created from a ring wrap. I do

        cdef RingWrap rw = ncalgebra(self._c, self._d, ring = P)

        #       rw._output()
        self._ring = singular_ring_reference(rw._ring)

hence incref, because at the end of the initialisation of self, rw will be deleted, so that the refcount for self._ring will then be fine. The same holds for new_CRing and new_NRing in sage.rings.polynomial.plural.

Some experiments with the new patch

Worst of all: I did not run the tests yet.

Here is a test whether there is a memleak when creating a super-commutative algebra repeatedly. Note that one needs

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
13145.patch
trac12215_weak_cached_function-sk.patch
trac12215_segfault_fixes.patch
trac_12313-mono_dict-combined-random-sk.patch
trac_12313_quit_sage.patch
trac13370_deprecate_is_field.patch
trac_13447-sanitise_ring_refcount.patch

because otherwise super commutative algebras could not be garbage collected.

sage: from sage.rings.polynomial.plural import SCA
sage: import gc
sage: while 1:
....:     E = SCA(QQ, ['x', 'y', 'z'], [0, 1], order = 'degrevlex')
....:     print get_memory_usage(),id(E)
....:     del E
....:     _ = gc.collect()
....:     
223.98828125 4520110400
223.98828125 4520102992
223.98828125 4520145584
223.98828125 4520100800
223.98828125 4520190656
223.98828125 4520116128
223.98828125 4516130448
223.98828125 4520183424
223.98828125 4338291168
223.98828125 4520148816
223.98828125 4520173408
223.98828125 4510376384
223.98828125 4520106672
...

Conclusion: There is no leak, even though it really is a new ring each time.

However, not all is good. Namely, we interrupt the computation above, and continue as follows:

sage: from sage.libs.singular.function import singular_function
sage: P.<x,y,z> = PolynomialRing(QQ)
sage: ringlist = singular_function("ringlist")
sage: l = ringlist(P)
// ** redefining my_awesome_sage_ring **

So, here you see the problem with creating the mock ring. I can only hope that Hans will tell us how one can create a currRingHdl for a given currRing, rather than only finding a previously created currRingHdl.

Because of that last problem, I think it doesn't make much sense to test. However, comments are welcome.

simon-king-jena commented 12 years ago
comment:75

Unfortunately one gets a segfault in the tests of sage/rings/polynomial/multi_polynomial_ideal.py, apparently related with a singular_function call (namely in the method variety).

Too bad. I really thought that the new approach towards refcounting is safe.

nbruin commented 12 years ago
comment:76

However, not all is good. Namely, we interrupt the computation above, and continue as follows:

sage: from sage.libs.singular.function import singular_function
sage: P.<x,y,z> = PolynomialRing(QQ)
sage: ringlist = singular_function("ringlist")
sage: l = ringlist(P)
// ** redefining my_awesome_sage_ring **

I think what your code does, is look up if there is a Singular variable that points to currRing. If so, you use that one and you let the Singular variable my_awesome_sage_ring dangling in the system. If you don't find a Singular variable pointing to currRing then you (re)create the variable my_awesome_ring to refer to currRing. But it could be that that variable is alive and well, happily pointing at some other ring (that once was currRing). I think the system probably has good reason to complain.

What you should probably do is create my_awesome_sage_ring, keep a pointer to it, and refcount rings that you let it point to. Now you're simply holding a singular variable reference to a ring. When you need currRingHdl to point somewhere, you can do that via my_awesome_sage_ring. If you need to NULL currRingHdl, you can do so without consequence. You still keep your own pointer to my_awesome_sage_ring.

It means that the ring pointed to by my_awesome_sage_ring is protected against garbage collection, but that's correct: We're holding a reference to it! Such rings will become eligible for deletion once my_awesome_sage_ring is made to point elsewhere. If you don't like rings that were currRing to survive until currRingHdl is actually needed to point elsewhere, you could NULL the data.uring field of my_awesome_sage_ring as soon as you get a chance, but I wouldn't bother.

I haven't traced through your precise usage, but grabbing and dropping Singular variables without adapting reference counts sounds fishy to me and liable to lead to segfault.

I'm pretty sure currRingHdl is just a Singular Interpreter level entry to currRing and probably in the singular interpreter, the current ring is always only a ring that actually has a name in the interpreter, so currRingHdl is simply a borrowed reference to that variable. That's why currRingHdl manipulations in singular do not involve refcounting (I haven't checked, but it seems to me you claim they don't): The rings themselves are refcounted, but that has already happened when the handle struct that currRingHdl is pointing at was instantiated. That struct is surving independent of currRingHdl because it represents a variable binding in the singular interpreter and therefore is probably indexed in some tree, linked list or hash table (i.e., the index that rFindHdl goes digging in).

simon-king-jena commented 12 years ago
comment:77

I guess the recreation of my_awesome_sage_ring can be avoided, because Singular certainly has methods to return a previously defined ring when you know the name of that ring. So, no need that Sage keeps a pointer to it, because Singular already does.

The segfault in sage/rings/polynomial/multi_polynomial_ideal.pyx can be avoided by setting ref=1 rather than ref=0 in the function singular_ring_new. However, I wouldn't call it a "fix", because I think this would mean the ring could never be rKilled.

So, I suspect that the reference count goes wrong in a different location.

simon-king-jena commented 12 years ago
comment:78

Replying to @simon-king-jena:

I guess the recreation of my_awesome_sage_ring can be avoided, because Singular certainly has methods to return a previously defined ring when you know the name of that ring.

It is ggetid.

simon-king-jena commented 12 years ago
comment:79

Replying to @simon-king-jena:

It is ggetid.

Using it does solve the problem with the warning message in comment:74.

In addition, it makes the segfault in sage -t devel/sage/sage/rings/polynomial/multi_polynomial_ideal.pyx disappear - which is a bad news, because instead the test hangs during some call to a singular_function...

simon-king-jena commented 12 years ago

Attachment: trac_13447-attempted_improvement.patch.gz

Experiments towards fixing some problems

simon-king-jena commented 12 years ago
comment:80

I have posted a new patch attachment: trac_13447-attempted_improvement.patch. According to Hans, it may happen that Singular functions return NULL, which means that there has been an error. Therefore, I suggest to actually raise a RuntimeError if the return value is NULL. Apart from that, I postpone another ggetid call to a location where currRing is guaranteed to exist; however, according to Hans, that call to ggetid (namely in the case of a library functions) should be fine also without currRing.

Here is the problem:

sage -t --verbose devel/sage/sage/rings/polynomial/multi_polynomial_ideal.py
...
Trying:
    R = QQ['a, b']; (a, b,) = R._first_ngens(2); I = R.ideal(a**Integer(2)+b**Integer(2)-Integer(1))###line 3581:_sage_    >>> R.<a,b> = QQ[]; I = R.ideal(a^2+b^2-1)
Expecting nothing
ok
Trying:
    Q = QuotientRing(R,I); K = Frac(Q)###line 3582:_sage_    >>> Q = QuotientRing(R,I); K = Frac(Q)
Expecting nothing
// ** char_series returns 0 x 0 matrix from 2 input polys (0)
I[1,1]=b2+a2-1*** *** Error: TIMED OUT! PROCESS KILLED! *** ***

         [360.3 s]

I have absolutely no idea why that happens.

The error occurs in the Singular library function primdecSY. I did check that currRing is definitely not NULL when calling the function, currRingHdl.data.uring==currRing, and there are plenty of references to currRing. Moreover, the error does not occur if one copies the whole example (which is from the groebner_basis method) into an interactive session.

primdecSY hangs, there is no return value (not even NULL).

Since there is no segfault, I guess gdb would not help here. Since currRing is referenced several times, I guess that it has not been double-freed - but who knows? Can it be that a ring was double-freed, then a new ring was created in the same location, and then an error occurs when working with that new ring?

For now:

Apply trac_13447-sanitise_ring_refcount.patch trac_13447-attempted_improvement.patch

simon-king-jena commented 12 years ago
comment:81

By the way, has anyone preserved an old version of attachment: trac_13447-sanitise_ring_refcount.patch? It used to work well, although Nils had some objections.

Here is the result of the test suite:

        sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ideal.py # Time out
        sage -t  -force_lib devel/sage/doc/de/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/doc/en/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/doc/fr/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/doc/ru/tutorial/tour_advanced.rst # Time out
        sage -t  -force_lib devel/sage/sage/libs/singular/function.pyx # 5 doctests failed
        sage -t  -force_lib devel/sage/sage/schemes/generic/algebraic_scheme.py # Time out

Looks like a lot of work. Fortunately, the errors in function.pyx seem to be a bit more "expressive" than the timeouts.

simon-king-jena commented 12 years ago
comment:82

I found something fishy with the refcount during sage.libs.singular.function.call_function.

I changed the code as follows:


    # In the Singular interpreter, we must ensure that currRing->ref > 0.
    si_ring.ref += 1  
    cdef int orig_ref = si_ring.ref   # CHANGE: store the reference count
    try:
        with opt_ctx: # we are preserving the global options state here
            if signal_handler:
                sig_on()
                _res = self.call_handler.handle_call(argument_list, si_ring)
                sig_off()
            else:
                _res = self.call_handler.handle_call(argument_list, si_ring)

        if myynest:
            myynest = 0

        if currentVoice:
            currentVoice = NULL

        if errorreported:
            errorreported = 0
            raise RuntimeError("Error in Singular function call '%s':\n %s"%
                (self._name, "\n ".join(error_messages)))
        if si_ring.ref!=orig_ref:  # CHANGE: Test if Singular has changed the refcount
            ff = file("/Users/SimonKing/blubber","a")
            ff.write( "Here is something fishy!\n")
            ff.close()

        res = argument_list.to_python(_res)
        ...

When running the failing sage -t devel/sage/sage/rings/polynomial/multi_polynomial_ideal.py, the refcount changes exactly once before the test hangs.

It is definitely not supposed to happen that Singular changes the refcount of currRing. Changes in the ref counter should only happen in the next line, when the results are converted to python (any polynomial will increase the refcount by one).

I don't know yet what function is causing the trouble - but I hope I'll find out soon...