rtoy / maxima

A Clone of Maxima's repo
Other
0 stars 0 forks source link

Repeated run_testsuite leads to out of memory error #4206

Open rtoy opened 4 months ago

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:24 Created by robert_dodier on 2023-12-29 23:34:22 Original: https://sourceforge.net/p/maxima/bugs/4236


Working with Maxima built from commit 3a26747 with SBCL 2.3.7 on Ubuntu 16.04. Running the test suite twice, once with share_tests = true, leads to an out of memory error in either rtest_ctensor or rtest_itensor (seems to vary).

(%i1) run_testsuite (); run_testsuite (share_tests = true);
Testsuite run for SBCL 2.3.7:
Running tests in rtest_sqdnst: 13/13 tests passed
Running tests in rtest_extensions: 18/18 tests passed
Running tests in rtest_rules: 210/210 tests passed
[... etc etc ...]
Running tests in rtest_ilt: 31/31 tests passed
Running tests in ulp_tests: 63/63 tests passed

No unexpected errors found out of 13,463 tests.
Evaluation took:
  108.034 seconds of real time
  107.751332 seconds of total run time (102.396675 user, 5.354657 system)
  [ Real times consist of 4.411 seconds GC time, and 103.623 seconds non-GC time. ]
  [ Run times consist of 4.406 seconds GC time, and 103.346 seconds non-GC time. ]
  99.74% CPU
  9,620 forms interpreted
  12,149 lambdas converted
  248,913,931,560 processor cycles
  37,916,342,224 bytes consed

(%o0)                                done
(%i1) Testsuite run for SBCL 2.3.7:
Running tests in rtest_sqdnst: 13/13 tests passed
Running tests in rtest_extensions: 18/18 tests passed
Running tests in rtest_rules: 210/210 tests passed
[... etc etc ...]
Running tests in rtest_bernstein: 44/44 tests passed
Running tests in rtest_atensor: 20/20 tests passed
Running tests in rtest_ctensor: Thread local storage exhausted.
fatal error encountered in SBCL pid 15290 tid 15290:
%PRIMITIVE HALT called; the party is over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>    

I tried some variations of run_testsuite and that combination is what I found that seems to cause the error repeatably.

I haven't tried to figure out what operation in rtest_ctensor or rtest_itensor is the immediate cause of the out of memory error. Possibly simplification rules? Just a wild guess.

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:25 Created by robert_dodier on 2023-12-30 06:46:52 Original: https://sourceforge.net/p/maxima/bugs/4236/#9ff2


Yeah, I think I've bumped into similar errors before. I'd like to try to at least identify what is the source of the error, even if it's something that we can't or won't fix.

I bumped into this error because I've been doing a lot of testing for the Unicode pretty printer. It is a bit of a nuisance to run into unrelated errors while trying to test some new code ...

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:28 Created by robert_dodier on 2023-12-30 06:47:45 Original: https://sourceforge.net/p/maxima/bugs/4236/#14ad


rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:32 Created by robert_dodier on 2023-12-30 06:47:45 Original: https://sourceforge.net/p/maxima/bugs/4236/#dcc4


I've edited the title to omit mention of the tensor tests.

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:36 Created by robert_dodier on 2023-12-30 06:48:14 Original: https://sourceforge.net/p/maxima/bugs/4236/#5911


rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:39 Created by peterpall on 2023-12-30 06:52:37 Original: https://sourceforge.net/p/maxima/bugs/4236/#a64b/69a7


My guess is that this out of memory actually is an indicator for a real problem, even if I don't know if it is to be called a bug: Maxima loves special variables, but SBCL reserves only a small memory portion for the thread-local storage they are placed in. In my daily work I run out of thread-local memory every few months and matchdeclare seems to cause such out of memories quickly.

sbcl 1.5.2 claims to have boosted thesize of said memory to 4096 objects and added a command-line switch (--tls-limit) that allows to further increase it. 4096 variables and lost items looks like being not much => perhaps our build system should boost that number if s cl is new enough to understand that command-line switch.

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:43 Created by peterpall on 2023-12-30 07:18:39 Original: https://sourceforge.net/p/maxima/bugs/4236/#5911/d256


configure.ac already seems to boost sbcl's number of thread-local symbols to a value that allows to run the test suite once:

# The default of 4096 is sometimes too little for the test suite.
if test x"${sbcl}" = xtrue ; then
   AC_MSG_CHECKING(if sbcl complains if we try to enlarge the thread-local storage)
   echo "(quit)" | ${SBCL_NAME} --tls-limit 8192 > /dev/null 2>&1
   if test "$?" = "0" ; then
    SBCL_EXTRA_ARGS="--tls-limit 8192"
    AC_MSG_RESULT(Yes)
   else
    SBCL_EXTRA_ARGS=""
    AC_MSG_RESULT(No)
   fi
fi

The question now is: Should we increase that value from 8192 to 16384 - or should we try to find out if we somewhere unnecessarily generate such symbols and therefore can catch this problem at its root?

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:46 Created by robert_dodier on 2023-12-30 07:22:20 Original: https://sourceforge.net/p/maxima/bugs/4236/#a64b/69a7/f139


I think we need to know in more detail what is going on with thread local storage for special variables or something like that. If that is the origin of the problem or at least a contributing factor, it should be possible to measure the storage allocation (I don't know how to do that for SBCL, I assume it is possible) every now and then and show that it increases until it fails with an error.

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:50 Created by peterpall on 2023-12-30 11:00:12 Original: https://sourceforge.net/p/maxima/bugs/4236/#5911/d256/3d8d


Weird: In my local maxima copy I increased the size of the thread-local memory storage from 8192 to 16384. Now running the testsuite a second time causes a call stack overflow => we might get trapped here in an infinite recursive function call or something.

rtoy commented 4 months ago

Imported from SourceForge on 2024-07-09 20:03:53 Created by robert_dodier on 2023-12-30 18:13:21 Original: https://sourceforge.net/p/maxima/bugs/4236/#5911/d256/47da


The question now is: Should we increase that value from 8192 to 16384 - or should we try to find out if we somewhere unnecessarily generate such symbols and therefore can catch this problem at its root?

It is quite unclear what exactly is going on, therefore it is too early to start adjusting configuration parameters in hopes of avoiding the problem.

If you are interesting in pursuing the possibility that thread local memory allocation or generated symbols or any other specific cause is at the root of the problem, please investigate with whatever tools are available (I don't know what those might be) and please report what you find, and we'll go from there.