Closed jonassmedegaard closed 6 months ago
Huh. I can't imagine. One problem is that we run the python tests before running the plain-C tests; I would like to know if the plain-C tests pass or not. If they passed, I would then point my finger at something in python and/or swig, which is generating the python bindings.
I guess that the way to test this is to build a container with a 32-bit distro in it.
BTW, just before spotting this issue, I spun a brand new version 5.10.1 to fix a break when the perl bindings are enabled (they should probably remain disabled, since most users should probably be using the Lingua::LinkParser bindings instead.)
I will check that ASAP. (In the last time a similar thing happened on 32-bit CPUs, it was due to an uninitialized struct field that happened to contain zero on 64-bit due to different field sizes and the use of unions).
@linas not sure why you mention the 5.10.1 release in relation to this issue (seems unrelated to me), but in case it is helpful, that newer release seems to fails just the same. Details are at https://buildd.debian.org/status/package.php?p=link-grammar
Please do tell if you guys think this is an issue with the testsuite rather than with the link-grammar code itself - then I can simply ignore the failures on 32bit architectures for now.
OK, So ... I checked. Indeed, I see exactly the same crash. It is occurring in a test of some brand-new code. Stubbing out that test: YGenerationTestCase
lines 1099 to 1111, here: https://github.com/opencog/link-grammar/blob/a613961489bbeaf4333a53c6139ab6a9adb363a9/bindings/python-examples/tests.py#L1099-L1111 allows the unit test to pass. Basically, link-grammar works fine on 32-bit, except for something strange in some brand new code that no one except me uses.
The actual malloc crash is quite bizzare, as the arguments passed to it appear to be just fine. @ampli I'm stumped as to what the actual bug is.
Thanks, Linas.
I will then simply skip those two tests when targeting 32bit archs.
--
Tlf.: +45 40843136 Website: http://dr.jones.dk/
[x] quote me freely [ ] ask before reusing [ ] keep private
I will then simply skip those two tests when targeting 32bit archs.
Yes, please.
Meanwhile: @ampli, I unleashed valgrind on this. I used the suppressions file here: https://github.com/opencog/atomspace/blob/master/lib/valgrind.python.suppressions and then I did this:
PYTHONPATH=/usr/local/./lib/python3.9/site-packages/linkgrammar valgrind --suppressions=valgrind.python.suppressions python3 tests.py
I get the following output:
valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 32, hi = 0.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata. If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away. Please try that before reporting this as a bug.
Thread 1: status = VgTs_Runnable (lwpid 16346)
==16346== at 0x4038C47: realloc (vg_replace_malloc.c:834)
==16346== by 0x52F633F: eliminate_duplicate_disjuncts (disjunct-utils.c:449)
==16346== by 0x53011C2: prepare_to_parse (preparation.c:207)
==16346== by 0x5300088: classic_parse (parse.c:373)
==16346== by 0x52DB82C: sentence_parse (api.c:711)
==16346== by 0x52B9F77: _wrap_sentence_parse (lg_python_wrap.cc:5855)
==16346== by 0x81791C9: ??? (in /usr/bin/python3.9)
==16346== by 0x8159E2E: _PyObject_MakeTpCall (in /usr/bin/python3.9)
==16346== by 0x8153CF0: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==16346== by 0x8163FCF: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==16346== by 0x8171AFD: ??? (in /usr/bin/python3.9)
==16346== by 0x8159D2B: _PyObject_MakeTpCall (in /usr/bin/python3.9)
Note the line number 449 is off, because I added some print statements: it should be line 444.
... and I found the bug. Dohhh. It's easy,
Fixed in commit d57493eed849eccd2c7ff6220d5997b24c2c7766
I will be spinning a version 5.10.2
if the Apple Homebrew folks verifiy the python install paths. (issue #1246) This new version will be very thin, just fixing this bug and the homebrew bug. Otherwise, shipping version 5.10.0 or 5.10.1 is just fine.
Closing, cause I think this is now resolved.
Sorry for not responding in a timely manner. This fix seems good. Meanwhile, by looking at the code around the fix I found ways to make it slightly faster (added to my TODO list).
Sorry, but release 5.10.1 with commits ca8c0e0 and d57493e succeeds on i386 but still fails on armel and armhf (other archs still uncertain): https://buildd.debian.org/status/package.php?p=link-grammar
Compared to original report, code now succeeds on ia32 arches (i386, hurd-i386, x32 - unknown about kfreebsd-i386), but continues to fail on armel, armhf and mipsel.
When you say "succeeds" I assume you mean "succeeds when the python test is disabled" ... The armel, armhf and mipsel builds fail for exactly the same reason as the ia32 builds. (Any 32-bit system will fail the same way.)
I can spin a version 5.10.2 with the required fix later today, maybe that is the easiest path?
By "succeeds" I mean building release 5.10.1 with commits ca8c0e0 and d57493e completes without failing its full testsuite. Here is the full build log for x32: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=x32&ver=5.10.1%7Edfsg-3&stamp=1631714789&raw=0
By "fails" I mean building release 5.10.1 with commits ca8c0e0 and d57493e does not complete without failing its testsuite. Here is the full build log for armhf: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=mipsel&ver=5.10.1%7Edfsg-3&stamp=1631714920&raw=0
(Any 32-bit system will fail the same way.)
No: armel, armhf and mipsel seemingly fails the same as they did with 5.10.0 and with 5.10.1 with no additional commits applied, and same as ia32 failed with 5.10.0 and 5.10.1 with no additional commits applied - but not the same as identical build for ia32, because ia32 builds now succeeds.
OK, so that is .. unexpected, and I cannot guess at the moment. Could you build with the following patch applied:
--- a/link-grammar/disjunct-utils.c
+++ b/link-grammar/disjunct-utils.c
@@ -441,6 +441,12 @@ Disjunct *eliminate_duplicate_disjuncts(Disjunct *dw, bool multi_string)
if (dx->num_categories == dx->num_categories_alloced - 1)
{
dx->num_categories_alloced *= 2;
+printf("duude addr=%p struct-sz=%zu ncat = %u newsize=%zu\n",
+dx->category,
+sizeof(*(dx->category)),
+dx->num_categories_alloced,
+sizeof(*(dx->category)) * dx->num_categories_alloced);
+fflush(stdout);
dx->category = realloc(dx->category,
sizeof(*(dx->category)) * dx->num_categories_alloced);
}
Another possibility would be this; it assumes that sizeof()
does not work as expected on these arches.
--- a/link-grammar/disjunct-utils.c
+++ b/link-grammar/disjunct-utils.c
@@ -441,8 +441,14 @@ Disjunct *eliminate_duplicate_disjuncts(Disjunct *dw, bool multi_string)
if (dx->num_categories == dx->num_categories_alloced - 1)
{
dx->num_categories_alloced *= 2;
+printf("duude addr=%p struct-sz=%zu ncat = %u newsize=%zu\n",
+dx->category,
+sizeof(Category_cost),
+dx->num_categories_alloced,
+sizeof(Category_cost) * dx->num_categories_alloced);
+fflush(stdout);
dx->category = realloc(dx->category,
- sizeof(*(dx->category)) * dx->num_categories_alloced);
+ sizeof(Category_cost) * dx->num_categories_alloced);
}
dassert((d->category[0].num > 0) && (d->category[0].num < 64*1024),
"Insane category %u", d->category[0].num);
Oops, I forget the fflush(stdout)
it will fail to print without that. Editing above post now.
It was also a problem with the initial size of dx->category
(for the same reason).
I tested it on armv7l virtual machine (armhf architecture).
... And is fixed in #1250 .. I will spin a 5.10.2 shortly.
... and version 5.10.2 is now available.
I can confirm that this issue no longer occurs: https://buildd.debian.org/status/logs.php?pkg=link-grammar&ver=5.10.2~dfsg-1
(another different but possibly related error occurs on powerpc - the failures on kfreebsd-amd64 and sh4 are Debian-specific).
I looked at the powerpc build log: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=powerpc&ver=5.10.2%7Edfsg-1&stamp=1631821070&raw=0 it says the multi-dict
unit test failed. This is not python... it tests the use of multiple dictionaries in multiple threads. No clue what the problem may be.
I no longer have a powerpc login on which to test. One of the universities - OSU-USL - Oklahoma State University Unix Systems Labs - use to provide free logins for porting software. @ampli you have worked magic on stuff like this .. are you using QEMU to do that?
Debian has hardware and I am happy to endorse requests to grant access to some of you as needed. Details of the procedure is at https://dsa.debian.org/doc/guest-account/ and general info on using the hardware is at https://wiki.debian.org/PorterBoxHowToUse
are you using QEMU to do that?
Yes. For the armhf architecture I used a Ubuntu cloud image with a virt-install
command line.
Closing. I retested on i386 Debian 11 aka bullseye and all unit tests pass.
Building link-grammar 5.10.0 on Debian fails on architectures armel, armhf, i386, mipsel, hurd-i386, kfreebsd-i386, and x32 - i.e. all little-endian 32-bit architectures if I am not mistaken. Overview here: https://buildd.debian.org/status/package.php?p=link-grammar (but please ignore kfreebsd-amd64 failing for a different Debian-specific reason). Excerpt from build log of armhf build:
Final message is same for mentioned architectures except i386-based ones. On i386, hurd-i386, and x32 it is
malloc(): invalid size (unsorted)
. On kfreebsd-i386 it is*** Error in
/usr/bin/python3': malloc(): memory corruption: 0x09361120 ***`.