opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
388 stars 119 forks source link

python test fails on little-endian 32bit archs: realloc(): invalid next size #1247

Closed jonassmedegaard closed 6 months ago

jonassmedegaard commented 3 years ago

Building link-grammar 5.10.0 on Debian fails on architectures armel, armhf, i386, mipsel, hurd-i386, kfreebsd-i386, and x32 - i.e. all little-endian 32-bit architectures if I am not mistaken. Overview here: https://buildd.debian.org/status/package.php?p=link-grammar (but please ignore kfreebsd-amd64 failing for a different Debian-specific reason). Excerpt from build log of armhf build:

Running by: /usr/bin/python3
Running ./tests.py in: /<<PKGBUILDDIR>>/bindings/python-examples
PYTHONPATH=./../python:../python:../python/.libs
srcdir=.
LINK_GRAMMAR_DATA=./../../data
Compiled with: gcc __VERSION__="10.3.0"  
OS: linux-gnueabihf __unix__ 
Standards: __STDC_VERSION__=201112L _POSIX_C_SOURCE=200809L 
Configuration (source code):
    CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS=-D_DEFAULT_SOURCE -std=c11 -D_BSD_SOURCE -D_SVID_SOURCE -D_GNU_SOURCE -D_ISOC11_SOURCE -g -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -O3 -fvisibility=hidden
Configuration (features):
    DICTIONARY_DIR=/usr/share/link-grammar
    -DPACKAGE_NAME="link-grammar" -DPACKAGE_TARNAME="link-grammar" -DPACKAGE_VERSION="5.10.0" -DPACKAGE_STRING="link-grammar 5.10.0" -DPACKAGE_BUGREPORT="https://github.com/opencog/link-grammar" -DPACKAGE_URL="https://www.abisource.com/projects/link-grammar" -DPACKAGE="link-grammar" -DVERSION="5.10.0" -DYYTEXT_POINTER=1 -DHAVE_STDIO_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_STRINGS_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_UNISTD_H=1 -DHAVE_THREADS_H=1 -DSTDC_HEADERS=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=".libs/" -DHAVE_STRNDUP=1 -DHAVE_STRTOK_R=1 -DHAVE_SIGACTION=1 -DHAVE_ALIGNED_ALLOC=1 -DHAVE_POSIX_MEMALIGN=1 -DHAVE_ALLOCA_H=1 -DHAVE_ALLOCA=1 -DHAVE_FORK=1 -DHAVE_VFORK=1 -DHAVE_WORKING_VFORK=1 -DHAVE_WORKING_FORK=1 -DHAVE_PRCTL=1 -D__STDC_FORMAT_MACROS=1 -D__STDC_LIMIT_MACROS=1 -DTLS=_Thread_local -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_VISIBILITY=1 -DHAVE_LOCALE_T_IN_LOCALE_H=1 -DHAVE_STDATOMIC_H=1 -DUSE_WORDGRAPH_DISPLAY=1 -DHAVE_SQLITE3=1 -DHAVE_HUNSPELL=1 -DHUNSPELL_DICT_DIR="/usr/share/hunspell" -DHAVE_EDITLINE=1 -DHAVE_WIDECHAR_EDITLINE=1 -DHAVE_PCRE2_H=1 -DHAVE_MAYBE_UNINITIALIZED=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DSTRERROR_R_CHAR_P=1
Using <module 'linkgrammar' from '/<<PKGBUILDDIR>>/bindings/python/linkgrammar.py'>
Using <module 'clinkgrammar' from '/<<PKGBUILDDIR>>/bindings/python/clinkgrammar.py'>
Using <module '_clinkgrammar' from '/<<PKGBUILDDIR>>/bindings/python/.libs/_clinkgrammar.so'>
Using <module 'lg_testutils' from '/<<PKGBUILDDIR>>/bindings/python-examples/lg_testutils.py'>
............................................s...............s.s.......................s.s..realloc(): invalid next size
FAIL tests.py (exit status: 134)

Final message is same for mentioned architectures except i386-based ones. On i386, hurd-i386, and x32 it is malloc(): invalid size (unsorted). On kfreebsd-i386 it is *** Error in/usr/bin/python3': malloc(): memory corruption: 0x09361120 ***`.

linas commented 3 years ago

Huh. I can't imagine. One problem is that we run the python tests before running the plain-C tests; I would like to know if the plain-C tests pass or not. If they passed, I would then point my finger at something in python and/or swig, which is generating the python bindings.

I guess that the way to test this is to build a container with a 32-bit distro in it.

BTW, just before spotting this issue, I spun a brand new version 5.10.1 to fix a break when the perl bindings are enabled (they should probably remain disabled, since most users should probably be using the Lingua::LinkParser bindings instead.)

ampli commented 3 years ago

I will check that ASAP. (In the last time a similar thing happened on 32-bit CPUs, it was due to an uninitialized struct field that happened to contain zero on 64-bit due to different field sizes and the use of unions).

jonassmedegaard commented 3 years ago

@linas not sure why you mention the 5.10.1 release in relation to this issue (seems unrelated to me), but in case it is helpful, that newer release seems to fails just the same. Details are at https://buildd.debian.org/status/package.php?p=link-grammar

jonassmedegaard commented 3 years ago

Please do tell if you guys think this is an issue with the testsuite rather than with the link-grammar code itself - then I can simply ignore the failures on 32bit architectures for now.

linas commented 3 years ago

OK, So ... I checked. Indeed, I see exactly the same crash. It is occurring in a test of some brand-new code. Stubbing out that test: YGenerationTestCase lines 1099 to 1111, here: https://github.com/opencog/link-grammar/blob/a613961489bbeaf4333a53c6139ab6a9adb363a9/bindings/python-examples/tests.py#L1099-L1111 allows the unit test to pass. Basically, link-grammar works fine on 32-bit, except for something strange in some brand new code that no one except me uses.

The actual malloc crash is quite bizzare, as the arguments passed to it appear to be just fine. @ampli I'm stumped as to what the actual bug is.

jonassmedegaard commented 3 years ago

Thanks, Linas.

I will then simply skip those two tests when targeting 32bit archs.

--

linas commented 3 years ago

I will then simply skip those two tests when targeting 32bit archs.

Yes, please.

Meanwhile: @ampli, I unleashed valgrind on this. I used the suppressions file here: https://github.com/opencog/atomspace/blob/master/lib/valgrind.python.suppressions and then I did this:

PYTHONPATH=/usr/local/./lib/python3.9/site-packages/linkgrammar valgrind --suppressions=valgrind.python.suppressions python3 tests.py

I get the following output:

valgrind: m_mallocfree.c:303 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 32, hi = 0.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.

Thread 1: status = VgTs_Runnable (lwpid 16346)
==16346==    at 0x4038C47: realloc (vg_replace_malloc.c:834)
==16346==    by 0x52F633F: eliminate_duplicate_disjuncts (disjunct-utils.c:449)
==16346==    by 0x53011C2: prepare_to_parse (preparation.c:207)
==16346==    by 0x5300088: classic_parse (parse.c:373)
==16346==    by 0x52DB82C: sentence_parse (api.c:711)
==16346==    by 0x52B9F77: _wrap_sentence_parse (lg_python_wrap.cc:5855)
==16346==    by 0x81791C9: ??? (in /usr/bin/python3.9)
==16346==    by 0x8159E2E: _PyObject_MakeTpCall (in /usr/bin/python3.9)
==16346==    by 0x8153CF0: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==16346==    by 0x8163FCF: _PyFunction_Vectorcall (in /usr/bin/python3.9)
==16346==    by 0x8171AFD: ??? (in /usr/bin/python3.9)
==16346==    by 0x8159D2B: _PyObject_MakeTpCall (in /usr/bin/python3.9)

Note the line number 449 is off, because I added some print statements: it should be line 444.

... and I found the bug. Dohhh. It's easy,

linas commented 3 years ago

Fixed in commit d57493eed849eccd2c7ff6220d5997b24c2c7766

I will be spinning a version 5.10.2 if the Apple Homebrew folks verifiy the python install paths. (issue #1246) This new version will be very thin, just fixing this bug and the homebrew bug. Otherwise, shipping version 5.10.0 or 5.10.1 is just fine.

Closing, cause I think this is now resolved.

ampli commented 3 years ago

Sorry for not responding in a timely manner. This fix seems good. Meanwhile, by looking at the code around the fix I found ways to make it slightly faster (added to my TODO list).

jonassmedegaard commented 3 years ago

Sorry, but release 5.10.1 with commits ca8c0e0 and d57493e succeeds on i386 but still fails on armel and armhf (other archs still uncertain): https://buildd.debian.org/status/package.php?p=link-grammar

jonassmedegaard commented 3 years ago

Compared to original report, code now succeeds on ia32 arches (i386, hurd-i386, x32 - unknown about kfreebsd-i386), but continues to fail on armel, armhf and mipsel.

linas commented 3 years ago

When you say "succeeds" I assume you mean "succeeds when the python test is disabled" ... The armel, armhf and mipsel builds fail for exactly the same reason as the ia32 builds. (Any 32-bit system will fail the same way.)

I can spin a version 5.10.2 with the required fix later today, maybe that is the easiest path?

jonassmedegaard commented 3 years ago

By "succeeds" I mean building release 5.10.1 with commits ca8c0e0 and d57493e completes without failing its full testsuite. Here is the full build log for x32: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=x32&ver=5.10.1%7Edfsg-3&stamp=1631714789&raw=0

By "fails" I mean building release 5.10.1 with commits ca8c0e0 and d57493e does not complete without failing its testsuite. Here is the full build log for armhf: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=mipsel&ver=5.10.1%7Edfsg-3&stamp=1631714920&raw=0

(Any 32-bit system will fail the same way.)

No: armel, armhf and mipsel seemingly fails the same as they did with 5.10.0 and with 5.10.1 with no additional commits applied, and same as ia32 failed with 5.10.0 and 5.10.1 with no additional commits applied - but not the same as identical build for ia32, because ia32 builds now succeeds.

linas commented 3 years ago

OK, so that is .. unexpected, and I cannot guess at the moment. Could you build with the following patch applied:

--- a/link-grammar/disjunct-utils.c
+++ b/link-grammar/disjunct-utils.c
@@ -441,6 +441,12 @@ Disjunct *eliminate_duplicate_disjuncts(Disjunct *dw, bool multi_string)
                                if (dx->num_categories == dx->num_categories_alloced - 1)
                                {
                                        dx->num_categories_alloced *= 2;
+printf("duude addr=%p struct-sz=%zu ncat = %u newsize=%zu\n",
+dx->category,
+sizeof(*(dx->category)),
+dx->num_categories_alloced,
+sizeof(*(dx->category)) * dx->num_categories_alloced);
+fflush(stdout);
                                        dx->category = realloc(dx->category,
                                           sizeof(*(dx->category)) * dx->num_categories_alloced);
                                }

Another possibility would be this; it assumes that sizeof() does not work as expected on these arches.

--- a/link-grammar/disjunct-utils.c
+++ b/link-grammar/disjunct-utils.c
@@ -441,8 +441,14 @@ Disjunct *eliminate_duplicate_disjuncts(Disjunct *dw, bool multi_string)
                                if (dx->num_categories == dx->num_categories_alloced - 1)
                                {
                                        dx->num_categories_alloced *= 2;
+printf("duude addr=%p struct-sz=%zu ncat = %u newsize=%zu\n",
+dx->category,
+sizeof(Category_cost),
+dx->num_categories_alloced,
+sizeof(Category_cost) * dx->num_categories_alloced);
+fflush(stdout);
                                        dx->category = realloc(dx->category,
-                                          sizeof(*(dx->category)) * dx->num_categories_alloced);
+                                          sizeof(Category_cost) * dx->num_categories_alloced);
                                }
                                dassert((d->category[0].num > 0) && (d->category[0].num < 64*1024),
                                        "Insane category %u", d->category[0].num);
linas commented 3 years ago

Oops, I forget the fflush(stdout) it will fail to print without that. Editing above post now.

ampli commented 3 years ago

It was also a problem with the initial size of dx->category (for the same reason). I tested it on armv7l virtual machine (armhf architecture).

linas commented 3 years ago

... And is fixed in #1250 .. I will spin a 5.10.2 shortly.

linas commented 3 years ago

... and version 5.10.2 is now available.

jonassmedegaard commented 3 years ago

I can confirm that this issue no longer occurs: https://buildd.debian.org/status/logs.php?pkg=link-grammar&ver=5.10.2~dfsg-1

(another different but possibly related error occurs on powerpc - the failures on kfreebsd-amd64 and sh4 are Debian-specific).

linas commented 3 years ago

I looked at the powerpc build log: https://buildd.debian.org/status/fetch.php?pkg=link-grammar&arch=powerpc&ver=5.10.2%7Edfsg-1&stamp=1631821070&raw=0 it says the multi-dict unit test failed. This is not python... it tests the use of multiple dictionaries in multiple threads. No clue what the problem may be.

I no longer have a powerpc login on which to test. One of the universities - OSU-USL - Oklahoma State University Unix Systems Labs - use to provide free logins for porting software. @ampli you have worked magic on stuff like this .. are you using QEMU to do that?

jonassmedegaard commented 3 years ago

Debian has hardware and I am happy to endorse requests to grant access to some of you as needed. Details of the procedure is at https://dsa.debian.org/doc/guest-account/ and general info on using the hardware is at https://wiki.debian.org/PorterBoxHowToUse

ampli commented 3 years ago

are you using QEMU to do that?

Yes. For the armhf architecture I used a Ubuntu cloud image with a virt-install command line.

linas commented 6 months ago

Closing. I retested on i386 Debian 11 aka bullseye and all unit tests pass.