vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
1.15k stars 136 forks source link

RFC: better debugging #526

Open jodavies opened 4 months ago

jodavies commented 4 months ago

Here are a few ideas which may help with debugging in case of crashes, particularly for longer running code and crashes which other people can't reproduce. The first uses eu-addr2line to print a stack trace in Terminate. The second replaces Terminate everywhere with a macro which inserts file, line and function information into the error message, so one can see where exactly Terminate was called. The third retains debugging symbols in the form build. This "probably" doesn't have a measurable performance impact (needs to be tested), but it means you get a useful trace from form crashes, and not just vorm.

Any thoughts?

coveralls commented 4 months ago

Coverage Status

coverage: 49.976% (-0.02%) from 49.999% when pulling 4dfffb182c7b59795578cd0f147574411743553c on jodavies:backtrace into 83e3d4185efca2e5938c665a6df9d67d6d9492ca on vermaseren:master.

jodavies commented 3 months ago

In the current state, a FORM crash looks like:

FORM 4.1 (May 31 2024, v4.1-20131025-641-g7062bd7)  Run: Tue Jun  4 10:49:26 2024
    Symbol x;
    CFunction f;
    Local test = 1/x;
    Multiply replace_(x,0);
    .end
Division by zero during normalization
Program terminating at crash.frm Line 4 -->
Terminate called from normal.c:4135 (Normalize). Backtrace:
#0: Terminate at startup.c:1799:20
#1: Normalize at normal.c:4139:2
#2: Generator at proces.c:3272:20
#3: Generator at proces.c:4216:8
#4: Generator at proces.c:4216:8
#5: Processor at proces.c:406:11
#6: DoExecute at execute.c:858:67
#7: PreProcessor at pre.c:1041:17
#8: main at startup.c:1692:2
#9: __libc_start_call_main at libc_start_call_main.h:58:16
#10: call_init at libc-start.c:128:20
#10:  (inlined by) __libc_start_main_impl at libc-start.c:379:5
#11: _start at ??:0
  0.00 sec out of 0.29 sec

or

TFORM 4.1 (May 31 2024, v4.1-20131025-641-g7062bd7) 4 workers  Run: Tue Jun  4 10:51:03 2024
    Symbol x;
    CFunction f;
    Local test = 1/x;
    Multiply replace_(x,0);
    .end
Division by zero during normalization
Program terminating in thread 1 at crash.frm Line 4 -->
Terminate called from normal.c:4135 (Normalize). Backtrace:
#0: Terminate at startup.c:1799:20
#1: Normalize at normal.c:4138:2
#2: Generator at proces.c:3272:20
#3: Generator at proces.c:4216:8
#4: Generator at proces.c:4216:8
#5: RunThread at threads.c:1393:10
#6: start_thread at pthread_create.c:442:8
#7: __clone3 at clone3.S:83
  0.00 sec + 0.00 sec: 0.00 sec out of 0.23 sec
jodavies commented 3 months ago

I cleaned this up a bit more, eg, it doesn't print any functions below main (or start_thread for tform).

I don't see any performance regressions due to keeping debug symbols and frame pointers in the release build.

My opinion would be to always print the trace by default, but if anyone strongly opposes this I would add a setup parameter to control it.

tueda commented 3 months ago

It doesn't look nice when eu-addr2line is not available...

TFORM 5.0.0-beta.1 (Jun 13 2024, v5.0.0-beta.1-69-g4c9b872) 4 workers  Run: Thu Jun 13 19:26:48 2024
    Symbol x;
    CFunction f;
    Local test = 1/x;
    Multiply replace_(x,0);
    .end
Division by zero during normalization
Program terminating in thread 1 at test.frm Line 4 -->
Terminate called from ../../sources/normal.c:4135 (Normalize)
Backtrace:
# 0: sh: 1: eu-addr2line: not found
# 1: sh: 1: eu-addr2line: not found
# 2: sh: 1: eu-addr2line: not found
# 3: sh: 1: eu-addr2line: not found
# 4: sh: 1: eu-addr2line: not found
# 5: sh: 1: eu-addr2line: not found
# 6: sh: 1: eu-addr2line: not found
# 7: sh: 1: eu-addr2line: not found
  0.00 sec + 0.00 sec: 0.00 sec out of 0.08 sec

I wonder if we really need to replace Terminate with TERMINATE. In principle, Terminate can be macronized so that it calls the original Terminate (renamed to, say, TerminateImpl) using the current source file name and the line number. (I'm not sure which is better.)

jodavies commented 3 months ago

For TERMINATE I was vaguely following the "convention" that macros have uppercase names, but in the form sources it is true that plenty of macros do not have this. For me it would be fine to rename TERMINATE as you suggest.

As for eu-addr2line, we could check for it (with which eu-addr2line? Edit: command can be used for this, and is POSIX?) and print a useful message that the user might want to install it.

tueda commented 3 months ago

I think command -v is in POSIX.

Another question would be: do we want FORM to depend on glibc? backtrace is neither POSIX nor Linux system call.

jodavies commented 3 months ago

FORM already links libc from somewhere, I did not need to provide the linker with -lc. Whether this is actually a required link, I am not sure.

tueda commented 3 months ago

I'm not referring to libc, but rather to glibc, which is usually linked to programs on GNU/Linux. But, there are other implementations like musl libc on Alpine Linux.

jodavies commented 3 months ago

I see, sorry. Then as I understand the "proper" way to do this is for the configure script to check for backtrace, and then we include this code only on platforms where it exists?

tueda commented 3 months ago

Yes, I think so. Or, optionally, we can consider a new configure script switch like --enable-backtrace, which is automatically enabled if backtrace is available.

tueda commented 3 months ago

About Terminate vs. TERMINATE, I would like to hear others' opinions.

jodavies commented 3 months ago

These should implement the changes we discussed. The last can be reverted if TERMINATE is preferred after all.

jodavies commented 3 months ago

I thought the last commit broke the static build but actually it went wrong before. But github gave it a green tick regardless...? Anyway it looks like https://sourceware.org/bugzilla/show_bug.cgi?id=15648

If I link with -lpthread instead of -pthread it works but I am not really sure what the difference is there.

tueda commented 3 months ago

The multiple __lll_lock_wait_private issue will be resolved if we bump the Ubuntu version from 20.04 to 22.04, where the glibc version is 2.35.

jodavies commented 3 months ago

This is true, but the build will still be broken for users who want to do their own static build (for whatever reason) on older systems. Edit: though of course in this (rare?) situation one can just --disable-backtrace.