sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.47k stars 486 forks source link

Make libsingular multivariate polynomial rings collectable #13447

Open nbruin opened 12 years ago

nbruin commented 12 years ago

Presently, #715 + #11521 help not permanently keeping parent in memory. In the process we uncovered a hard-but-consistently triggerable problem with the collection of MPolynomialRing_libsingular. We have only observed the problem on bsd.math.washington.edu, MacOSX 10.6 on x86_64.

The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to bsd.math)

Depends on #11521

CC: @simon-king-jena @malb @vbraun @gagern @robertwb @sagetrac-ylchapuy @jpflori @burcin

Component: memleak

Author: Simon King

Branch/Commit: public/make_libsingular_multivariate_polynomial_rings_collectable @ b4df239

Reviewer: Travis Scrimshaw

Issue created by migration from https://trac.sagemath.org/ticket/13447

nbruin commented 12 years ago
comment:1

On 5.4-beta0 + #715 + #11521, there is a doctest failure on bsd.math.washington.edu, an x86_64 machine running MacOSX 10.6:

bash-3.2$ ../../sage -t sage/misc/cachefunc.pyx
sage -t  "devel/sage-main/sage/misc/cachefunc.pyx"          
The doctested process was killed by signal 11
         [12.7 s]

----------------------------------------------------------------------
The following tests failed:

        sage -t  "devel/sage-main/sage/misc/cachefunc.pyx" # Killed/crashed
Total time for all tests: 12.7 seconds

The segmentation fault happens reliably, but is hard to study because

The segfault happens in the doctest for CachedMethodCaller._instance_call (line 1038 in the sage source; example_27 in the file ~/.sage/tmp/cachefunc_*.py left after doctesting), in the line

            sage: P.<a,b,c,d> = QQ[]

Further instrumentation showed that the segfault happens in sage/libs/singular/ring. pyx, in singular_ring_new, in the part that copies the strings over.

+    sys.stderr.write("before _names allocation\n")
     _names = <char**>omAlloc0(sizeof(char*)*(len(names)))
+    sys.stderr.write("after _names allocation\n")

     for i from 0 <= i < n:
         _name = names[i]
+        sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i])
        _names[i] = omStrDup(_name)
+        sys.stderr.write("after omStrDup\n")

The call _omStrDup segfaults for i=1. Unwinding the _omStrDup call:

     for i from 0 <= i < n:
         _name = names[i]
+        sys.stderr.write("calling omStrDup for i=%s with name=%s\n"%(i,names[i]))
-        _names[i] = omStrDup(_name)
+        j = 0
+        while <bint> _name[j]:
+            j+=1
+        j+=1     #increment to include the 0
+        sys.stderr.write("string length (including 0) seems to be %s\n"%j)
+        copiedname =  <char*>omAlloc(sizeof(char)*(j+perturb))
+        sys.stderr.write("Done reserving memory buffer; got address %x\n"%(<long>copiedname))
+        for 0 <= offset < j:
+            sys.stderr.write("copying character nr %s\n"%offset)
+            copiedname[offset] = _name[offset]
+        _names[i] = copiedname
+        sys.stderr.write("after omStrDup\n")

shows that it's actually the omAlloc call segfaulting. For perturb=7 or higher, the segfault does not happen. For perturb a lower value it does. Given that the omAlloc addresses returned on earlier calls do not seem close to a page boundary, the only way omAlloc can fail is basically by a corrupted freelist an 8-byte bin. Likely culprits:

Note the <char*> to <long> cast in the print statement. With an <int>, the compiler complains about loss of precision, but not with <long>. I haven't checked whether <long> is really 64 bits on this machine, though.

I have tried and the problem seems to persist with the old singular (5.4b0 has a recently upgraded singular).

It would help a lot if someone could build singular to use plain malloc throughout and then use valgrind or a similar tool, which should be able to immediately catch a double free or out-of-bounds error. If the root of the problem is not OSX-specific, this would even show up on other architectures.

See also [#715 comment:295 #715,comment 295] and below for some more details on how the diagnosis above was obtained.

nbruin commented 12 years ago
comment:3

OK, I did a little experiment and tried to build singular with plain malloc rather than omalloc. In principle, omalloc has an --with-emulate... flag, but the API offered in that mode is woefully incomplete for singular use. I tried to extend it a little. Very rough result:

http://sage.math.washington.edu/home/nbruin/singular-3-1-5.malloc.spkg

One problem is supplying a memdup, which needs to know the size of an allocated block from its pointer. On BSD, you can use malloc_size for that. On linux one could use malloc_usable_size. The rest is a whole swath of routines that need to be supplied.

The package above is very dirty, but on bsd.math it did provide me with an apparently working libsingular. The singular executable produced didn't seem usable, so keep your old one!

The doctest passes! Not exactly what we were hoping for, but it does make a double-free unlikely. That would have been detected. Corruption after freeing could still be possible, since malloc allocates way bigger blocks, so freelist data is likely missed.

There is of course also the possibility of some routine writing out of bounds, which is less likely to trigger problems with malloc too.

Singular people might be interested in incorporating the changes to omalloc (and preferrably extend them a little bit) so that --with-emulate... becomes a viable option for Singular debugging. Then you can valgrind singular code.

simon-king-jena commented 12 years ago

Upstream: Reported upstream. No feedback yet.

simon-king-jena commented 12 years ago
comment:5

I have contacted Hans Schönemann.

simon-king-jena commented 12 years ago
comment:6

I think Martin should be Cc for this as well.

I am not sure if changing to malloc is an acceptable option for Singular. If I understand correctly, omalloc is very important for having a good performance in Gröbner basis computations.

nbruin commented 12 years ago
comment:7

Replying to @simon-king-jena:

I am not sure if changing to malloc is an acceptable option for Singular.

I am sure it is not acceptable for production, but being able to swap out omalloc for debugging can be very useful. That's why I tried. I understand that there are great tools to do memory management debugging and omalloc puts them all out of play because it hides all memory alloc/free activity.

It seems omalloc has its own tools but I wasn't able to get them working, I've seen indications that they don't work on 64 bits, and there's a good chance they're not as good as the general ones because they're for a smaller market.

I'm sure someone more familiar with the Singular and omalloc code bases can make a more informed decision on whether having the option of straight malloc memory management for debugging is worthwhile. My initial finding is that it quite likely can be done with relatively small modifications. I got it to more or less work in an evening, while being unfamiliar with the code.

nbruin commented 12 years ago
comment:8

At least the problem is a real one. I've found a similar iMac:

    Hardware Overview:

      Model Name: iMac
      Model Identifier: iMac10,1
      Processor Name: Intel Core 2 Duo
      Processor Speed: 3.06 GHz
      Number Of Processors: 1
      Total Number Of Cores: 2
      L2 Cache: 3 MB
      Memory: 4 GB
      Bus Speed: 1.07 GHz
      Boot ROM Version: IM101.00CC.B00
      SMC Version (system): 1.52f9

    System Software Overview:

      System Version: Mac OS X 10.6.8 (10K549)
      Kernel Version: Darwin 10.8.0
      64-bit Kernel and Extensions: No
      Time since boot: 5 days 8:46

and bsd.math.washington.edu:

    Hardware Overview:

      Model Name: Mac Pro
      Model Identifier: MacPro5,1
      Processor Name: Quad-Core Intel Xeon
      Processor Speed: 2.4 GHz
      Number Of Processors: 2
      Total Number Of Cores: 8
      L2 Cache (per core): 256 KB
      L3 Cache (per processor): 12 MB
      Memory: 32 GB
      Processor Interconnect Speed: 5.86 GT/s
      Boot ROM Version: MP51.007F.B03
      SMC Version (system): 1.39f11
      SMC Version (processor tray): 1.39f11

    System Software Overview:

      System Version: Mac OS X 10.6.8 (10K549)
      Kernel Version: Darwin 10.8.0
      64-bit Kernel and Extensions: Yes
      Time since boot: 16 days 6:14

Both these machines exhibit the same problem that on 5.4b0 + #715 + #11521, the doctest for cachefunc.pyx segfaults in the same spot. Note that the iMac claims to not have a 64-bit kernel. Sage is compiled to be 64-bit on that machine, though (and seems to work).

Have we actually established that this bug does not occur on newer OSX versions?

simon-king-jena commented 12 years ago
comment:9

Replying to @nbruin:

Have we actually established that this bug does not occur on newer OSX versions?

And have we actually established that this problem does not occur with older Singular versions? I am not totally sure, but I think the problem with #715+#11521 first emerged in sage-5.4.beta0, when Singular-3-1-5 was merged.

nbruin commented 12 years ago
comment:10

Replying to @simon-king-jena:

And have we actually established that this problem does not occur with older Singular versions?

Quoting from comment:1

I have tried and the problem seems to persist with the old singular (5.4b0 has a recently upgraded singular).

In the mean time, a bit of googling led me to OSX's "GuardMalloc". While sage+singular-malloc does not crash on the doctest, it does crash when run with

export DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib

Since gmalloc is a memory manager that places each allocation on its own page with protected/unmapped memory as close as possible around the block and that unmaps the block as soon as freed (I'm just parroting the manpage), a segfault is likely due to an access-after-free or access-out-of-bounds -- the one that would normally cause the corruption and then the segfault much later. (that's the whole idea of replacing omalloc -- I don't think it's doable to get omalloc to segfault on an access-after-free). This all comes at a significant speed penalty of course, so experiments are painful and I wouldn't even be able to interpret the backtrace/coredump if I got it (I'd hope that the gmalloc-induced segfault would be reproducible in gdb). It would really be useful if the test file would be pared down to an absolute minimum. That's basically just a backtracking search on which elements can be removed while still triggering a segfault.

However, I think this is a strong indication that there is a real memory violation at the base of this and that it is tracable.

simon-king-jena commented 12 years ago
comment:11

I tried to track the problem as follows:

diff --git a/sage/libs/singular/ring.pyx b/sage/libs/singular/ring.pyx
--- a/sage/libs/singular/ring.pyx
+++ b/sage/libs/singular/ring.pyx
@@ -16,6 +16,8 @@

 include "../../ext/stdsage.pxi"

+import sys
+
 from sage.libs.gmp.types cimport __mpz_struct
 from sage.libs.gmp.mpz cimport mpz_init_set_ui, mpz_init_set

@@ -495,6 +497,8 @@
     cdef object r = wrap_ring(existing_ring)
     refcount = ring_refcount_dict.pop(r)
     ring_refcount_dict[r] = refcount+1
+    sys.stderr.write("reference %d to %d, wrapper %d\n"%(refcount+1,<size_t>existing_ring, id(r)))
+    sys.stderr.flush()
     return existing_ring

@@ -536,6 +540,8 @@

     cdef ring_wrapper_Py r = wrap_ring(doomed)
     refcount = ring_refcount_dict.pop(r)
+    sys.stderr.write("dereference level %d of %d, wrapper %d\n"%(refcount-1,<size_t>doomed, id(r)))
+    sys.stderr.flush()
     if refcount > 1:
         ring_refcount_dict[r] = refcount-1
         return
diff --git a/sage/rings/polynomial/multi_polynomial_libsingular.pyx b/sage/rings/polynomial/multi_polynomial_libsingular.pyx
--- a/sage/rings/polynomial/multi_polynomial_libsingular.pyx
+++ b/sage/rings/polynomial/multi_polynomial_libsingular.pyx
@@ -151,6 +151,7 @@
     sage: b-j*c
     b - 1728*c
 """
+import sys

 # The Singular API is as follows:
 #
@@ -242,7 +243,7 @@

 import sage.libs.pari.gen
 import polynomial_element
-
+from sage.libs.singular.ring cimport wrap_ring
 cdef class MPolynomialRing_libsingular(MPolynomialRing_generic):

     def __cinit__(self):
@@ -364,6 +365,8 @@
         from sage.rings.polynomial.polynomial_element import PolynomialBaseringInjection
         base_inject = PolynomialBaseringInjection(base_ring, self)
         self.register_coercion(base_inject)
+        sys.stderr.write("At %d, creating %s\n"%(<size_t>self._ring, self))
+        sys.stderr.flush()

     def __dealloc__(self):
         r"""
@@ -390,6 +393,16 @@
             sage: _ = gc.collect()
         """
         if self._ring != NULL:  # the constructor did not raise an exception
+            from sage.libs.singular.ring import ring_refcount_dict
+            try:
+                level = ring_refcount_dict[wrap_ring(self._ring)]
+            except KeyError:
+                level = -1
+            if level > 1:
+                sys.stderr.write("WARNING: %d\n"%(<size_t>self._ring))
+            else:
+                sys.stderr.write("__dealloc__: %s\n"%(<size_t>self._ring))
+            sys.stderr.flush()
             singular_ring_delete(self._ring)

     def __copy__(self):

Then, I ran python -t on the segfaulting test. Observation: It happens precisely twice that "WARNING" is printed, i.e., the __dealloc__ method is called even though there remain multiple references to the underlying libsingular ring.

In both cases it is QQ['a','b','c','d']. Here is a snipped from the output:

reference 2 to 4409548912, wrapper 4302568952
reference 3 to 4409548912, wrapper 4302569000
reference 4 to 4409548912, wrapper 4302568952
At 4409548912, creating Multivariate Polynomial Ring in a, b, c, d over Rational Field
reference 5 to 4409548912, wrapper 4302569000
reference 6 to 4409548912, wrapper 4302568952
reference 7 to 4409548912, wrapper 4302569000
reference 8 to 4409548912, wrapper 4302568952
reference 9 to 4409548912, wrapper 4302569000
reference 10 to 4409548912, wrapper 4302568952
reference 2 to 4409549416, wrapper 4302568928
reference 3 to 4409549416, wrapper 4302569000
dereference level 9 of 4409548912, wrapper 4302568928
dereference level 8 of 4409548912, wrapper 4302568952
dereference level 7 of 4409548912, wrapper 4302568928
dereference level 6 of 4409548912, wrapper 4302568952
dereference level 5 of 4409548912, wrapper 4302568928
dereference level 4 of 4409548912, wrapper 4302568952
dereference level 3 of 4409548912, wrapper 4302568928
WARNING: 4409548912
dereference level 2 of 4409548912, wrapper 4302568952
dereference level 1 of 4409548912, wrapper 4302568928
dereference level 0 of 4409548912, wrapper 4302568952

However, I am not totally sure whether this indicates a problem, because in both cases the remaining references are immediately removed. Also, it is always the case that 4 references are set to the libsingular ring before actually creating the polynomial ring in Sage.

One last observation: You may notice a libsingular ring at address 4409549416 that is referenced here as well, aparently in the middle of the construction of QQ['a','b','c','d']. It is later used for QQ['x','y','z']. The last report before the segfault is

reference 32 to 4409549416, wrapper 4302568952

Seems like a wild-goose chase to me, though.

simon-king-jena commented 12 years ago
comment:12

Replying to @simon-king-jena:

One last observation: You may notice a libsingular ring at address 4409549416 that is referenced here as well, aparently in the middle of the construction of QQ['a','b','c','d']. It is later used for QQ['x','y','z']. The last report before the segfault is

reference 32 to 4409549416, wrapper 4302568952

And this ring is in fact currRing when it crashes.

nbruin commented 12 years ago
comment:13

OK! good progress. Instrumenting sagedoc.py a little bit we can indeed see the order in which the doctests are executed:

__main__
__main__.change_warning_output
__main__.check_with_tolerance
__main__.example_0
__main__.example_1
__main__.example_10
__main__.example_11
__main__.example_12
__main__.example_13
__main__.example_14
__main__.example_15
__main__.example_16
__main__.example_17
__main__.example_18
__main__.example_19
__main__.example_2
__main__.example_20
__main__.example_21
__main__.example_22
__main__.example_23
__main__.example_24
__main__.example_25
__main__.example_26
__main__.example_27
Unhandled SIGSEGV

so that indeed seems to be alphabetical order.

Now let's run the doctests with singular-using-malloc. Result: No segfault. OSX comes with gmalloc, which is a guarded malloc for debugging purposes. It places every allocation on a separate page and unmaps that page upon freeing. So, any access-after-free leads to a segfault. Now we do get a segfault and it happens a lot sooner than example_27. In fact, now the segfault survives in gdb. The error happens when executing

G = I.groebner_basis()###line 921:_sage_    >>> G = I.groebner_basis()

Here's a session with gdb once the segfault has happened. I think I have been able to extract enough data to point at the probably problem.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x00000001850dbf44
__pyx_f_4sage_4libs_8singular_8function_call_function (__pyx_v_self=0x190ab8960, __pyx_v_args=0x190a8e810, __pyx_v_R=0x19c39be70, __pyx_optional_args=<value temporarily unavailable, due to optimizations>) at sage/libs/singular/function.cpp:13253
13253       currRingHdl->data.uring->ref = (currRingHdl->data.uring->ref - 1);
####NB: This is line 1410 in sage/libs/singular/function.pyx
(gdb) print currRingHdl
$1 = (idhdl) 0x17c2b5fd0
(gdb) print currRingHdl->data
$2 = {
  i = -2062696816,
  uring = 0x1850dbe90,
  p = 0x1850dbe90,
  n = 0x1850dbe90,
  uideal = 0x1850dbe90,
  umap = 0x1850dbe90,
  umatrix = 0x1850dbe90,
  ustring = 0x1850dbe90 <Address 0x1850dbe90 out of bounds>,
  iv = 0x1850dbe90,
  bim = 0x1850dbe90,
  l = 0x1850dbe90,
  li = 0x1850dbe90,
  pack = 0x1850dbe90,
  pinf = 0x1850dbe90
}
(gdb) print currRingHdl->data.uring
$3 = (ring) 0x1850dbe90
(gdb) print currRingHdl->data.uring->ref
Cannot access memory at address 0x1850dbf44
(gdb) print  *__pyx_v_si_ring
$10 = {
  idroot = 0x0, 
  order = 0x19c3cbff0, 
  block0 = 0x19c3cdff0, 
  block1 = 0x19c3cfff0, 
  parameter = 0x0, 
  minpoly = 0x0, 
  minideal = 0x0, 
  wvhdl = 0x19c3c9fe0, 
  names = 0x19c3bdfe0, 
  ordsgn = 0x19c3ddfe0, 
  typ = 0x19c3dffd0, 
  NegWeightL_Offset = 0x0, 
  VarOffset = 0x19c3d9ff0, 
  qideal = 0x0, 
  firstwv = 0x0, 
  PolyBin = 0x104ee8440, 
  ringtype = 0, 
  ringflaga = 0x0, 
  ringflagb = 0, 
  nr2mModul = 0, 
  nrnModul = 0x0, 
  options = 100663424, 
  ch = 0, 
  ref = 0, 
  float_len = 0, 
  float_len2 = 0, 
  N = 3, 
  P = 0, 
  OrdSgn = 1, 
  firstBlockEnds = 3, 
  real_var_start = 0, 
  real_var_end = 0, 
  isLPring = 0, 
  VectorOut = 0, 
  ShortOut = 0, 
  CanShortOut = 1, 
  LexOrder = 0, 
  MixedOrder = 0, 
  ComponentOrder = -1, 
  ExpL_Size = 3, 
  CmpL_Size = 3, 
  VarL_Size = 1, 
  BitsPerExp = 20, 
  ExpPerLong = 3, 
  pCompIndex = 2, 
  pOrdIndex = 0, 
  OrdSize = 1, 
  VarL_LowIndex = 1, 
  MinExpPerLong = 3, 
  NegWeightL_Size = 0, 
  VarL_Offset = 0x19c3e3ff0, 
  bitmask = 1048575, 
  divmask = 1152922604119523329, 
  p_Procs = 0x19c3e7f80, 
  pFDeg = 0x104a80150 <pDeg(spolyrec*, sip_sring*)>, 
  pLDeg = 0x104a80920 <pLDegb(spolyrec*, int*, sip_sring*)>, 
  pFDegOrig = 0x104a80150 <pDeg(spolyrec*, sip_sring*)>, 
  pLDegOrig = 0x104a80920 <pLDegb(spolyrec*, int*, sip_sring*)>, 
  p_Setm = 0x104a7ff40 <p_Setm_TotalDegree(spolyrec*, sip_sring*)>, 
  cf = 0x11e487e70, 
  algring = 0x0, 
  _nc = 0x0
}
(gdb) print __pyx_v_si_ring
$11 = (ip_sring *) 0x19c3c5e90
(gdb) print ((struct __pyx_obj_4sage_5rings_10polynomial_28multi_polynomial_libsingular_MPolynomialRing_libsingular *)__pyx_v_R)->_ring
$12 = (ip_sring *) 0x19c3c5e90
(gdb) print ((struct __pyx_obj_4sage_5rings_10polynomial_6plural_NCPolynomialRing_plural *)__pyx_v_R)->_ring
$13 = (ip_sring *) 0x10019ff30
####NB: so PY_TYPE_CHECK(R, MPolynomialRing_libsingular) is true
(gdb) print (__pyx_v_si_ring != currRing)
$15 = false
####NB: does this mean that rChangeCurrRing(si_ring) got executed or that si_ring already equalled currRing?
(gdb) print (currRingHdl->data.uring != currRing)
$16 = true
####NB: of course, that's why we segfault on the statement that follows:
####NB:       currRingHdl.data.uring.ref -= 1
(gdb) print *(currRingHdl->data.uring)
Cannot access memory at address 0x1850dbe90
####NB: It looks like currRingHdl.data.uring has been unbound.
####NB: naturally, changing a field on that pointer will corrupt memory (or in this case
####NB: because gmalloc has unmapped the page, cause a segfault)
####NB: Could it be that the code here should really test for uring being still valid?
####NB: (if it can do that at all)?

So I think the issue is in sage.lib.singular.function.call_function:

...
    if currRingHdl.data.uring!= currRing:
        currRingHdl.data.uring.ref -= 1
        currRingHdl.data.uring = currRing # ref counting?
        currRingHdl.data.uring.ref += 1
...

The evidence points absolutely to currRingHdl.data.uring pointing to unallocated (probably freed) memory. The access then of course can have all kinds of effects. At this point it is probably doable for a LibSingular expert to reason about the code whether uring should always be valid at this point (I suspect not).

It looks suspicious to me that sage.libs.singular.ring.singular_ring_delete does do a careful dance to zero out the currRing variable but doesn't seem to care about currRngHdl. I also find it worrying that there apparently is a refcount system right on the ring structures (as you can see above) and yet in singular_ring_delete a separate refcounting dict is used. One would think the same refcounting system should be borrowed by singular_ring_new and singular_ring_delete. It looks to me the code above thinks that by increasing ...uring.ref the reference is protected, but singular_ring_delete doesn't seem to take into account this refcount. It could well be that I'm misinterpreting the code and that this is all perfectly safe, though.

Libsingular specialists: Keep in mind that in principle, singular code can get executed in rather awkward moments, possibly as part of clean-ups of circular garbage and call-backs on weakref cleanup, where equality might be tested of objects that are soon to be deallocated themselves.

The fickleness of the bug is consistent with this condition arising during a cyclic garbage collection with just the right amount of junk around. That would make the occurrence of the bug depend on just about everything in memory. Or at least if you depend on the corruption leading to a segfault, it depends on which location exactly gets corrupted.

I think we might be getting close to a badge for debugging excellence here!

nbruin commented 12 years ago

Attachment: trac_13447-double_refcount.patch.gz

take into account both refcount_dict and ring*.ref fields on deletion.

nbruin commented 12 years ago

Attachment: trac_13447-consolidated_refcount.patch.gz

Consolidate two refcount systems (cruft not yet removed from patch)

nbruin commented 12 years ago
comment:14

OK, two independent patches. Either prevents the segfault. I may just have removed the symptom, but not the cause.

If I'm correctly understanding the problem, attachment: trac_13447-consolidated_refcount.patch should be the preferred solution. However, my unfamiliarity with (lib)singular's intricacies might have caused an oversight somewhere. I think my interpretation is consistent with the use in sage.lib.singular.function.call_function, which is my main source of inspiration.

If people agree, we can clean out the cruft remaining from the refcounting method implemented locally.

nbruin commented 12 years ago

Work Issues: Input from libsingular experts

simon-king-jena commented 12 years ago
comment:16

Replying to @nbruin:

If I'm correctly understanding the problem, attachment: trac_13447-consolidated_refcount.patch should be the preferred solution.

I didn't test the patch yet. However, it seems very straight forward to me: There already is a refcounting, and thus one should use it. I am Cc'ing Volker Braun and Martin von Gagern, the authors of #11339. Does attachment: trac_13447-consolidated_refcount.patch make sense to you as well?

Keeping a double refcount (as with attachment: trac_13447-double_refcount.patch seems suspicious to me.

Perhaps one should let the patchbots test it? Thus, I'll add this as dependency for #715, and for the patchbot:

Apply trac_13447-consolidated_refcount.patch

PS: You really deserve a badge for debugging excellence! Do I understand correctly that the bug is not on the side of Singular? I'll inform Hans accordingly.

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -1,3 +1,5 @@
 Presently, #715 + #11521 help not permanently keeping parent in memory. In the process we uncovered a hard-but-consistently triggerable problem with the collection of `MPolynomialRing_libsingular`. We have only observed the problem on `bsd.math.washington.edu`, MacOSX 10.6 on x86_64.

 The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to `bsd.math`)
+
+Apply [attachment: trac_13447-consolidated_refcount.patch](https://github.com/sagemath/sage-prod/files/10656328/trac_13447-consolidated_refcount.patch.gz)
nbruin commented 12 years ago
comment:17

With the new refcounting, I think it could be that currRingHdl.data.uring holds the last reference to a ring. In fact, it seems that was the source of the segfaults. If that reference is removed in call_function, shouldn't we delete the ring? The naive solution

...
        currRingHdl.data.uring.ref -= 1
        if currRingHdl.data.uring.ref == 0:
            rDelete(currRingHdl.data.uring)
        currRingHdl.data.uring = currRing # ref counting?
        currRingHdl.data.uring.ref += 1
...

seems to have no ill effect (I put a print statement there that did produce some output, so it does happen), but perhaps I'm overlooking something. Are there other places where references are liable to be lost?

simon-king-jena commented 12 years ago
comment:18

For the record, the following tests fail:

        sage -t  -force_lib devel/sage/sage/libs/singular/ring.pyx # 6 doctests failed
        sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py # 1 doctests failed
        sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_libsingular.pyx # 1 doctests failed

with

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac_13447-consolidated_refcount.patch

So, not all is good, but almost...

vbraun commented 12 years ago
comment:19

I don't know if ring.ref has any meaning to Singular. If we are indeed free to use that field for reference counting in Sage then I'm fine with trac_13447-consolidated_refcount.patch.

Upstream plans to get rid of the whole currRing global variable eventually, for the record.

simon-king-jena commented 12 years ago
comment:20

Two errors mentioned in comment:18 look (again) difficult.

The first one:

sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/modular/modsym/ambient.py", line 1351:
    sage: ModularSymbols(20,2).boundary_space().dimension()
Expected:
    6
Got:
    0

Hence, the way how one refcounts libsingular rings influences the dimension of Hecke modules. Strange at least...

Note, however, that the value returned by the "dimension()" method above is not constant, because it only returns a lower bound (if I recall correctly) that is increased when one learns more about the Hecke module. Hence, it could very well be that ModularSymbols(20,2).boundary_space() used to be cached but is now garbage collected, so that information on the dimension is lost.

The second error is apparently ignored and only printed to stderr:

sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ring.py
         [2.1 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial.pyx
         [4.3 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/groebner_fan.py
         [7.8 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_libsingular.pyx
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/rings/polynomial/multi_polynomial_libsingular.pyx", line 409:
    sage: len(ring_refcount_dict) == n + 1
Expected:
    True
Got:
    False
**********************************************************************
1 items had failures:
   1 of  19 in __main__.example_4
***Test Failed*** 1 failures.
For whitespace errors, see the file /Users/SimonKing/.sage/tmp/bsd.math.washington.edu-96119/multi_polynomial_libsingular_8098.py
         [4.2 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/multi_polynomial_ring_generic.pyx
         [2.3 s]
sage -t  -force_lib devel/sage/sage/rings/polynomial/polydict.pyx
         [2.0 s]

Since these were parallel tests, I can't tell were the ignored attribute errors actually came from.

simon-king-jena commented 12 years ago
comment:21

PS: When consolidating refcounters, we must not forget #13145, which only got merged in sage-5.4.beta1.

nbruin commented 12 years ago
comment:22

Replying to @simon-king-jena:

sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/modular/modsym/ambient.py", line 1351:
    sage: ModularSymbols(20,2).boundary_space().dimension()
Expected:
    6
Got:
    0

I have seen that error before, with other work-arounds (and I think also with singular-malloc), so if it's indeed only a lower bound, then sage has merely changed. It's not an error. If you're worried you can see where that dimension is computed and put a hard ref in the creation of the relevant object. If garbage collection is the cause of the observed amnesia, a hard ref should "solve" it. In that case you can just change the doctest answer.

The second error is apparently ignored and only printed to stderr:

Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored

This is a worrisome error because it's fickle. One a linux x86_64 box, get this reliably in sage/rings/polynomial/multi_polynomial_libsingular.pyx. When I let it print the lines it's doctesting I get:

set_random_seed(0L)
change_warning_output(sys.stdout)
F = GF(Integer(7)**Integer(2), names=('a',)); (a,) = F._first_ngens(1)###line 1913:_sage_    >>> F.<a> = GF(7^2)
R = F['x, y']; (x, y,) = R._first_ngens(2)###line 1914:_sage_    >>> R.<x,y> = F[]
p = a*x**Integer(2) + y + a**Integer(3); p###line 1915:_sage_    >>> p = a*x^2 + y + a^3; p
q = copy(p)###line 1917:_sage_    >>> q = copy(p)
p == q###line 1918:_sage_    >>> p == q
p is q###line 1920:_sage_    >>> p is q
lst = [p,q];###line 1922:_sage_    >>> lst = [p,q];
matrix(ZZ, Integer(2), Integer(2), lambda i,j: bool(lst[i]==lst[j]))###line 1923:_sage_    >>> matrix(ZZ, 2, 2, lambda i,j: bool(lst[i]==lst[j]))
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
matrix(ZZ, Integer(2), Integer(2), lambda i,j: bool(lst[i] is lst[j]))###line 1926:_sage_    >>> matrix(ZZ, 2, 2, lambda i,j: bool(lst[i] is lst[j]))
sig_on_count()

so it happens when doctesting line 1923. These are probably errors encountered during a dealloc, so it might be happening in a garbage collection. It could also be a WeakValueDict deletion callback that's trying to do a comparison that fails. Googling shows that you've asked about that exact error message on cython-users on 27 January, 2012, so if you solved the bug that led to that question then, perhaps you can also solve this one. It could also be a straight memory corruption. [edit:] OK that was on #11521. You didn't really find that error. You just made it go away by inserting a garbage collection. The good news is that this makes it not so likely that the patch here is causing a new memory corruption. It's more likely a lingering issue that once again gets triggered.

simon-king-jena commented 12 years ago
comment:23

Replying to @nbruin:

Replying to @simon-king-jena:

sage -t  -force_lib devel/sage/sage/modular/modsym/ambient.py
**********************************************************************
File "/scratch/sking/sage-5.4.beta0/devel/sage-main/sage/modular/modsym/ambient.py", line 1351:
    sage: ModularSymbols(20,2).boundary_space().dimension()
Expected:
    6
Got:
    0

I have seen that error before, with other work-arounds (and I think also with singular-malloc), so if it's indeed only a lower bound, then sage has merely changed. It's not an error.

I am not a number theorist, but I have learnt from the code that the dimension is computed from the number of "cusps". Hence, if one adds the compution of cusps to that test and assigns the involved Hecke modules to variables, then the tests pass:

            sage: M = ModularSymbols(20, 2)
            sage: B = M.boundary_space(); B
            Space of Boundary Modular Symbols for Congruence Subgroup Gamma0(20) of weight 2 and over Rational Field
            sage: M.cusps()
            [Infinity, 0, -1/4, 1/5, -1/2, 1/10]            
            sage: M.dimension()
            7
            sage: B.dimension()
            6

I think this would be a good solution.

The second error is apparently ignored and only printed to stderr:

Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored
Exception AttributeError: AttributeError('PolynomialRing_field_with_category' object has no attribute '_modulus',) in  ignored

This is a worrisome error because it's fickle. ... so it happens when doctesting line 1923. These are probably errors encountered during a dealloc, so it might be happening in a garbage collection. It could also be a WeakValueDict deletion callback that's trying to do a comparison that fails.

Agreed.

Googling shows that you've asked about that exact error message on cython-users on 27 January, 2012, so if you solved the bug that led to that question then, perhaps you can also solve this one.

Yes, but that question was a pure Cython question, namely like: "Wouldn't it be a good idea to print the function name in which an error was ignored, rather than printing an empty string? That would help debugging."

It could also be a straight memory corruption. [edit:] OK that was on #11521. You didn't really find that error.

Yes. But if it surfaces again, we should now solve it for good. I guess deletion from a weak dictionary is a likely candidate.

simon-king-jena commented 12 years ago
comment:24

I only find two files in sage/rings/ where the string "._modulus" occurs: polynomial_ring.py and polynomial_zz_pex.pyx:

devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:    c = parent._modulus
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:                d = parent._modulus.ZZ_pE(list(x.polynomial()))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:                d = parent._modulus.ZZ_pE(list(e_polynomial))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        d = self._parent._modulus.ZZ_pE(list(left.polynomial()))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        _a = self._parent._modulus.ZZ_pE(list(a.polynomial()))
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        self._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        self._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        left._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_zz_pex.pyx:        self._parent._modulus.restore()
devel/sage/sage/rings/polynomial/polynomial_ring.py:            self._modulus = ntl_ZZ_pEContext(ntl_ZZ_pX(list(base_ring.polynomial()), p))

Hence, it should be easy to find out which of the few locations is actually involved.

simon-king-jena commented 12 years ago
comment:25

I found that the attribute error occurs in the cdef function get_cparent in sage/rings/polynomial/polynomial_zz_pex.pyx. Next question is then, of course: At what point is get_cparent called?

simon-king-jena commented 12 years ago
comment:26

Printing messages to stderr, it seems to me that the error occurs during deallocation of a polynomial template, namely in sage/rings/polynomial/polynomial_template.pxi:

    def __dealloc__(self):
        """
        EXAMPLE::

            sage: P.<x> = GF(2)[]
            sage: del x
        """
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

Is the cparent of self deallocated too early (perhaps because the refcounting is still not accurate)?

Or is it a nasty race condition? Namely:

Question: If a polynomial is created, will the reference counter to the underlying libsingular ring be incremented?

nbruin commented 12 years ago
comment:27

Or is it a nasty race condition? Namely:

  • A polynomial p in a polynomial ring R is about to be garbage collected.
  • All python stuff is deleted first. In particular, p's reference to its parent R is gone.
  • Incidentally, because the reference from p to R is gone, R can now be collected as well.
  • When R gets deleted, its underlying libsingular ring is deallocated.
  • Now, p.__dealloc__ is finally called, and tries to access the underlying libsingular ring - but it is too late.

Question: If a polynomial is created, will the reference counter to the underlying libsingular ring be incremented?

From what I understand, __dealloc__ methods cannot assume that python attributes are still valid. They fundamentally cannot, because otherwise it wouldn't be possible to clean up cyclic garbage (__del__ methods are run when all attributes are valid and hence if they are present in cyclic garbage, it is not cleared).

So, I think that if the libsingular ring pointer is necessary during deallocation of a polynomial, then it should store it in a c-variable. Then it would indeed need to increase the reference counter.

You'd initially think that you could store a "c level" pointer to the python polynomial ring and manually increase the refcount. That would ensure that the python polynomial ring is still alive when __alloc__ gets called. However, it would also mean that there is an extra refcount that the cyclic garbage detector wouldn't be able to explain, so polynomial rings would always appear to have an "external" reference and hence never be eligible for garbage collection. Since rings tend to cache 0 and 1, such references would always be present and all you work would be for naught: Polynomial rings would exist forever.

So I think you have to bite the bullet and ensure that get_cparent doesn't access any python attributes or that you can avoid calling it in a __dealloc__.

EDIT: Or perhaps not. While looking at the code a bit I concluded I don't understand a bit of it, due to the templating. I think what I wrote above has some truth to it, but I honestly cannot say whether it has any relevance to the problem at hand. It seems to explain what you're experiencing.

We have:

    def __dealloc__(self):
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

and for us:

get_cparent(parent) == <ntl_ZZ_pEContext_class>(parent._modulus)

The _parent attribute is a cython slot. However, it holds a reference to a python-managed object, so I think cython ensures it's properly taken into account in GC cycle counting. But that would suggest to me python could clear this slot to break cycles! So in that case, Polynomial_template is never safe. It could be I'm wrong, however.

I haven't been able to locate what parent._modulus is in this case. However,

sage: K.<a>=GF(next_prime(2**60)**3)
sage: R.<x> = PolynomialRing(K,implementation='NTL')
sage: '_modulus' in  R.__dict__.keys()
True

suggests this attribute is stored in a dictionary. It's set in sage.rings.polynomial.polynomial_ring.PolynomialRing_field.__init__:1367

        if implementation == "NTL" and is_FiniteField(base_ring) and not(sparse):
            from sage.libs.ntl.ntl_ZZ_pEContext import ntl_ZZ_pEContext
            from sage.libs.ntl.ntl_ZZ_pX import ntl_ZZ_pX
            from sage.rings.polynomial.polynomial_zz_pex import Polynomial_ZZ_pEX

            p=base_ring.characteristic()
            self._modulus = ntl_ZZ_pEContext(ntl_ZZ_pX(list(base_ring.polynomial()), p))
            element_class = Polynomial_ZZ_pEX

I guess we've just found that this is not a very good place to store _modulus. Where else, though? Would it be enough to have a cythonized version of PolynomialRing_field so that _modulus can be tied a little tighter to the parent? It seems to me the parent is the right place to store this information. We just need to convince the parent to hold on to its information for a bit longer.

simon-king-jena commented 12 years ago
comment:29

Replying to @nbruin:

We have:

    def __dealloc__(self):
        celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent))

and for us:

get_cparent(parent) == <ntl_ZZ_pEContext_class>(parent._modulus)

The _parent attribute is a cython slot.

Interestingly, there is no complaint about a missing attribute _parent. It is _modulus that is missing.

However, it holds a reference to a python-managed object, so I think cython ensures it's properly taken into account in GC cycle counting. But that would suggest to me python could clear this slot to break cycles! So in that case, Polynomial_template is never safe. It could be I'm wrong, however.

I think you are right. The __dealloc__ of Polynomial_template is unsafe, unless polynomial rings will stay in memory forever. But I'd love to hear that we are wrong, because otherwise each polynomial would need a pointer to the c-data expected to be returned by get_cparent((<Polynomial_template>self)._parent), and we'd need to take into account reference counting for the c-parent during creation and deletion of polynomials.

Or perhaps there is a way out. We have a polynomial ring R and we have some elements a,b,c,... Each element points to R, and R points to some of its elements, namely to its generators. The problem is that deallocation of the elements is only possible as long as R is alive.

If we'd manually incref R upon creation of an element x, decrefing R when x gets deallocated, then we would ensure that R will survive until the last of its elements is deleted. Or would that mean that the elements will survive as well, because of the reference from R to its generators? Edit: Yes it would.

nbruin commented 12 years ago
comment:30

Why don't you do something along the lines of:

cdef cparent get_cparent(parent) except? NULL:
    if parent is None:
        return NULL
    cdef ntl_ZZ_pEContext_class c 
    try:
        c = parent._modulus
    except KeyError:
        c = <some vaguely appropriate proxy value>
    return &(c.x)

This should really only be happening upon deletion anyway and I I'd be surprised if having the correct _modulus is very important at that point. It's a dirty hack but alternatives probably mean a full reimplementation of these polynomial rings.

Looking in ntl_ZZ_pEX_linkage seems to indicate it only leads to

if parent != NULL: parent[0].restore()

so perhaps you can just use NULL as a proxy value!

nbruin commented 12 years ago

return NULL instead of raising AttributeError

nbruin commented 12 years ago
comment:31

Attachment: trac_13447-modulus_fix.patch.gz

YAY! indeed, returning NULL seems to solve the problem. I don't know whether there are any other ill effects, but since NULL was already returned upon missing parent, I think that with an incomplete parent it's an appropriate value too. It seems the NTL wrapper was already written with the possibility in mind of not having a valid parent around.

simon-king-jena commented 12 years ago

Changed upstream from Reported upstream. No feedback yet. to None of the above - read trac for reasoning.

simon-king-jena commented 12 years ago
comment:32

Great! I didn't expect that NULL would work here, because, after all, c-data of a polynomial is supposed to be deallocated with the help of c-data of a polynomial ring celement_destruct(&self.x, get_cparent((<Polynomial_template>self)._parent)) - but if someone has already thought of the possibility that the parent is invalid, then doing the same with an invalid parent._modulus seems the right thing to do.

While we are at it, I changed the "Reported Upstream" field, because it isn't an upstream bug, after all.

A bit later today, I will also provide a patch fixing the Hecke module dimensions, as in comment:23. Do you want me to ask a number theorists whether the fix ("Compute the cusps, which implies that the dimension is computed as well") is mathematically correct? Or is it enough for you that the same number as before (dimension 6) is obtained?

simon-king-jena commented 12 years ago

Author: Nils Bruin, Simon King

simon-king-jena commented 12 years ago

Changed work issues from Input from libsingular experts to Input from a libsingular expert

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -2,4 +2,12 @@

 The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to `bsd.math`)

-Apply [attachment: trac_13447-consolidated_refcount.patch](https://github.com/sagemath/sage-prod/files/10656328/trac_13447-consolidated_refcount.patch.gz)
+**Unmerge** #13145
+
+Apply
+
+* [attachment: trac_13447-consolidated_refcount.patch](https://github.com/sagemath/sage-prod/files/10656328/trac_13447-consolidated_refcount.patch.gz)
+* [attachment: trac_13447-modulus_fix.patch](https://github.com/sagemath/sage-prod/files/10656329/trac_13447-modulus_fix.patch.gz)
+* [attachment: trac_13447-rely_on_singular_refcount.patch](https://github.com/sagemath/sage-prod/files/10656330/trac_13447-rely_on_singular_refcount.patch.gz)
+
+**Merge together with** #715, #11521
simon-king-jena commented 12 years ago
comment:33

I have provided a new patch, that removes the custom refcounter, using Singular's refcounter (ring.ref) instead.

As I have announced, I also fixed the failing modular symbols test, by computing the dimension before displaying it: The test previously worked only because a computation happened in a different test that happened to be executed early enough, that side effect being possible because Hecke modules would stay in memory permanently.

I did not run the full test suite yet. But sage/rings/polynomial/plural.pyx and sage/rings/polynomial/multi_polynomial_libsingular.pyx and sage/modular/modsym/ambient.py all work.

Problems for the release manager and the reviewer:

Apply trac_13447-consolidated_refcount.patch trac_13447-modulus_fix.patch trac_13447-rely_on_singular_refcount.patch

simon-king-jena commented 12 years ago

Dependencies: #11521

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -4,7 +4,9 @@

 **Unmerge** #13145

-Apply
+**Apply**
+
+#715, #11521, and then

 * [attachment: trac_13447-consolidated_refcount.patch](https://github.com/sagemath/sage-prod/files/10656328/trac_13447-consolidated_refcount.patch.gz)
 * [attachment: trac_13447-modulus_fix.patch](https://github.com/sagemath/sage-prod/files/10656329/trac_13447-modulus_fix.patch.gz)
simon-king-jena commented 12 years ago
comment:35

Good news! With the new patches, i.e.

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
trac_13447-consolidated_refcount.patch
trac_13447-modulus_fix.patch
trac_13447-rely_on_singular_refcount.patch

there is only one crash with make ptest, namely

sage -t  -force_lib devel/sage/sage/libs/singular/groebner_strategy.pyx # Killed/crashed

The crash seems harmless, this time: It occurs at strat = GroebnerStrategy(None), and I suspect that the attempt to incref "None" is a bad idea...

While we are at it: Perhaps it would be better to not unmerge #13145, but to use it as a dependency.

simon-king-jena commented 12 years ago
comment:36

In order to get nice tests, I think one should introduce a function that returns the refcount of a commutative or noncommutative libsingular ring.

simon-king-jena commented 12 years ago
comment:37

Having a function that shows the refcount really is a good idea! I already found that elements of a non-commutative ring do not increment the refcount to the underlying "libplural" ring.

Anyway, a new test in the commutative setting shall be:

            sage: import gc
            sage: gc.collect()  # random
            sage: R.<x,y,z> = GF(5)[]
            sage: R._get_refcount()
            7
            sage: p = x*y+z
            sage: R._get_refcount()
            8
            sage: del p
            sage: gc.collect()  # random
            sage: R._get_refcount()
            7

Of course, the question is whether we really need to incref the ring if we create an element. I think, in the commutative case, it is needed, because deallocation of an element refers to the cparent.

It could be that in the non-commutative case we have already a work-around:

    def __dealloc__(self):
        # TODO: Warn otherwise!
        # for some mysterious reason, various things may be NULL in some cases
        if self._parent is not <ParentWithBase>None and (<NCPolynomialRing_plural>self._parent)._ring != NULL and self._poly != NULL:
            p_Delete(&self._poly, (<NCPolynomialRing_plural>self._parent)._ring)

I think we could leave it like that, for now. If someone feels it is needed, then he/she may change NCPolynomial_plural to use templates.

nbruin commented 12 years ago
comment:38

I think all the wrap_ring and ring_wrapper stuff can go from polynomial_libsingular. I think this was only there to provide a dictionary key for the ring_refcount_dictionary. Any code that uses it is liable to require change anyway, so deleting it is probably a good thing.

simon-king-jena commented 12 years ago
comment:39

Replying to @nbruin:

I think all the wrap_ring and ring_wrapper stuff can go from polynomial_libsingular. I think this was only there to provide a dictionary key for the ring_refcount_dictionary. Any code that uses it is liable to require change anyway, so deleting it is probably a good thing.

Sure. I am about to prepare a new patch version, that uses singular_ring_reference and singular_ring_delete consequently (and not with a manual ..._ring.ref += 1, as in sage.libs.singular.function).

simon-king-jena commented 12 years ago
comment:40

There is one nasty detail with singular_function. If one sets ring = singular_function('ring') and then uses the singular_function to create a ring, then its reference counter is not incremented, even though the following function is called in this case:

cdef inline RingWrap new_RingWrap(ring* r):
    cdef RingWrap ring_wrap_result = PY_NEW(RingWrap)
    ring_wrap_result._ring = r
    ring_wrap_result._ring.ref += 1

    return ring_wrap_result

I do not understand it, yet. But anyway, that's the problem I am currently dealing with.

simon-king-jena commented 12 years ago

Attachment: trac_13447-rely_on_singular_refcount.patch.gz

Use Singular's refcounter for refcounting

simon-king-jena commented 12 years ago
comment:41

The problems that I mentioned are now solved with the new patch version.

Note that the new patch is relative to #13145, which I made a new dependency for #715.

So, for the record:

$ hg qa
trac_715_combined.patch
trac_715_local_refcache.patch
trac_715_safer.patch
trac_715_specification.patch
trac_11521_homset_weakcache_combined.patch
trac_11521_callback.patch
13145.patch
trac_13447-consolidated_refcount.patch
trac_13447-modulus_fix.patch
trac_13447-rely_on_singular_refcount.patch

Let's keep the fingers crossed that the tests pass this time.

Apply trac_13447-consolidated_refcount.patch trac_13447-modulus_fix.patch trac_13447-rely_on_singular_refcount.patch

simon-king-jena commented 12 years ago

Description changed:

--- 
+++ 
@@ -2,11 +2,9 @@

 The present work-around is to permanently store references to these upon creation, thus preventing collection. It would be nice if we could properly solve the problem (or at least establish that the problem is specific to `bsd.math`)

-**Unmerge** #13145
-
 **Apply**

-#715, #11521, and then
+#13145, #715, #11521, and then

 * [attachment: trac_13447-consolidated_refcount.patch](https://github.com/sagemath/sage-prod/files/10656328/trac_13447-consolidated_refcount.patch.gz)
 * [attachment: trac_13447-modulus_fix.patch](https://github.com/sagemath/sage-prod/files/10656329/trac_13447-modulus_fix.patch.gz)