vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
1.12k stars 135 forks source link

Mincer crash #214

Open jodavies opened 7 years ago

jodavies commented 7 years ago

While testing various disk/compression settings with mincer, I get a strange crash:

Time =      17.38 sec    Generated terms =    4700985
            d65c         Terms in output =    3136630
      Prepare first loop Bytes used      = 1181651608
Program terminating in thread 1 at ACCU2 Line 8 --> 
Floating point exception (core dumped)

This is using the mincer test in the speedtest dir. I can provide the files if you don't have them. The crash happens for #define POW "7" and higher, but ONLY if compression is switched off. I comment out the on,compress,gzip,6; line in treatgzgz.prc and add Off compress; at the top of calcdia.frm.

It runs fine with On Compress; in calcdia.frm.

Sorry, this is probably not a particularly useful bug report. Can someone try to reproduce this? I have the same behavior on more than one machine. FORM 4.2.0.

EDIT: v4.1-20131025 crashes also.

jodavies commented 7 years ago

GDB spam for tvorm: (seems weird, I don't see how one can get SIGFPE from any code in Simplify...)

Time =      22.33 sec    Generated terms =    4700985
            d65c         Terms in output =    3136630
      Prepare first loop Bytes used      = 1181651608
Error while reading scratch file in GetOneTerm (3)
Program terminating in thread 7 at ACCU2 Line 8 --> 

Thread 3 "tvorm-git" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7ffe9e04a700 (LWP 3760)]
0x00000000004d74f4 in Simplify (B=B@entry=0x7ffe340008c0, a=a@entry=0x7ffe34332068, na=na@entry=0x7ffe9e0498e4, 
    b=b@entry=0x7ffe34380288, nb=nb@entry=0x7ffe9e0498d8) at reken.c:555
555                             do { y1 = y2 % y3; y2 = y3; } while ( ( y3 = y1 ) != 0 );
(gdb) bt
#0  0x00000000004d74f4 in Simplify (B=B@entry=0x7ffe340008c0, a=a@entry=0x7ffe34332068, na=na@entry=0x7ffe9e0498e4, 
    b=b@entry=0x7ffe34380288, nb=nb@entry=0x7ffe9e0498d8) at reken.c:555
#1  0x00000000004d846b in MulRat (B=B@entry=0x7ffe340008c0, a=<optimised out>, na=<optimised out>, b=<optimised out>, 
    b@entry=0x7ffe1c2fc4e0, nb=<optimised out>, nb@entry=1, c=c@entry=0x7ffe1c2fc554, nc=0x7ffe9e049974) at reken.c:428
#2  0x00000000004c0b6d in PrepPoly (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2fc390, par=par@entry=0) at proces.c:4892
#3  0x00000000004c276a in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2fc390, level=74, level@entry=73)
    at proces.c:3132
#4  0x00000000004ca27d in Deferred (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2ae190, level=73) at proces.c:4596
#5  0x00000000004c2692 in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2ae190, level=73) at proces.c:3115
#6  0x00000000004c443f in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2ae110, level=61) at proces.c:3931
#7  0x00000000004c443f in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c287010, level=2, level@entry=0) at proces.c:3931
#8  0x0000000000522af5 in RunThread (dummy=<optimised out>) at threads.c:1384
#9  0x00007ffff6e866ba in start_thread (arg=0x7ffe9e04a700) at pthread_create.c:333
#10 0x00007ffff6bbc3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

It crashes also with form (takes 10 mins or so), I will start running with valgrind...

vermaseren commented 7 years ago

Little guess: division by zero. Or: y3 = 0. The main question: how can that happen?

Cheers

Jos

On 7 jul. 2017, at 16:58, Josh Davies notifications@github.com wrote:

GDB spam for tvorm: (seems weird, I don't see how one can get SIGFPE from any code in Simplify...)

Time = 22.33 sec Generated terms = 4700985 d65c Terms in output = 3136630 Prepare first loop Bytes used = 1181651608 Error while reading scratch file in GetOneTerm (3) Program terminating in thread 7 at ACCU2 Line 8 -->

Thread 3 "tvorm-git" received signal SIGFPE, Arithmetic exception. [Switching to Thread 0x7ffe9e04a700 (LWP 3760)] 0x00000000004d74f4 in Simplify (B=B@entry=0x7ffe340008c0, a=a@entry=0x7ffe34332068, na=na@entry=0x7ffe9e0498e4, b=b@entry=0x7ffe34380288, nb=nb@entry=0x7ffe9e0498d8) at reken.c:555 555 do { y1 = y2 % y3; y2 = y3; } while ( ( y3 = y1 ) != 0 ); (gdb) bt

0 0x00000000004d74f4 in Simplify (B=B@entry=0x7ffe340008c0, a=a@entry=0x7ffe34332068, na=na@entry=0x7ffe9e0498e4,

b=b@entry=0x7ffe34380288, nb=nb@entry=0x7ffe9e0498d8) at reken.c:555

1 0x00000000004d846b in MulRat (B=B@entry=0x7ffe340008c0, a=, na=, b=,

b@entry=0x7ffe1c2fc4e0, nb=<optimised out>, nb@entry=1, c=c@entry=0x7ffe1c2fc554, nc=0x7ffe9e049974) at reken.c:428

2 0x00000000004c0b6d in PrepPoly (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2fc390, par=par@entry=0) at proces.c:4892

3 0x00000000004c276a in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2fc390, level=74, level@entry=73)

at proces.c:3132

4 0x00000000004ca27d in Deferred (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2ae190, level=73) at proces.c:4596

5 0x00000000004c2692 in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2ae190, level=73) at proces.c:3115

6 0x00000000004c443f in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c2ae110, level=61) at proces.c:3931

7 0x00000000004c443f in Generator (B=B@entry=0x7ffe340008c0, term=term@entry=0x7ffe1c287010, level=2, level@entry=0) at proces.c:3931

8 0x0000000000522af5 in RunThread (dummy=) at threads.c:1384

9 0x00007ffff6e866ba in start_thread (arg=0x7ffe9e04a700) at pthread_create.c:333

10 0x00007ffff6bbc3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

It crashes also with form (takes 10 mins or so), I will start running with valgrind...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/214#issuecomment-313705955, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEh2fMjb1OrMHDhAutn2EA-hhtg2xks5sLkeBgaJpZM4ORAVl.

jodavies commented 7 years ago

Well, those variables are all WORD right? Integer divisions can not cause a SIGFPE. My valgrind run will probably need to go overnight, I will let you know if it shows anything interesting.

vermaseren commented 7 years ago

I do not know which circuits on the chip execute that division, and what exception it sends it into when there is an error like division by zero. I could imagine though that it all sends it to the same interrupt routine. One would have to look at the machine (assembler) code to see which instruction is executed. In the past there were chips that did some of the integer stuff in floating point. Maybe that happens here as well (although I doubt that).

Jos

On 7 jul. 2017, at 21:12, Josh Davies notifications@github.com wrote:

Well, those variables are all WORD right? Integer divisions can not cause a SIGFPE. My valgrind run will probably need to go overnight, I will let you know if it shows anything interesting.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/214#issuecomment-313768861, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEuger_Athe1DTFNkeCv8k6IbZithks5sLoMygaJpZM4ORAVl.

tueda commented 7 years ago

Actually, SIGFPE can be raised by an integer zero division. https://en.wikipedia.org/wiki/Unix_signal#Relationship_with_hardware_exceptions says

For example, if a process attempted integer divide by zero on an x86 CPU, a divide error exception would be generated and cause the kernel to send the SIGFPE signal to the process.

I guess a floating-point number zero division may just give an infinity.

vermaseren commented 7 years ago

From asm-generic/siginfo.h /*

This seems to indicate that SIGFPE is a generic flag, covering 8 cases.

Jos

On 7 jul. 2017, at 21:24, Takahiro Ueda notifications@github.com wrote:

Actually, SIGFPE can be raised by an integer zero division. https://en.wikipedia.org/wiki/Unix_signal https://en.wikipedia.org/wiki/Unix_signal says

For example, if a process attempted integer divide by zero on an x86 CPU, a divide error exception would be generated and cause the kernel to send the SIGFPE signal to the process.

I guess a floating-point number zero division may just give an infinity.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/214#issuecomment-313771428, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxElF5HJacrjt3zwE0X3fP9Gd36Ycnks5sLoYKgaJpZM4ORAVl.

jodavies commented 7 years ago

Valgrind just gave


Time =   21142.30 sec    Generated terms =    4700985
            d65c         Terms in output =    3136630
      Prepare first loop Bytes used      = 1181651608
==20395== Warning: set address range perms: large range [0x572af040, 0x74f85540) (defined)
==20395== Warning: set address range perms: large range [0x572af040, 0x74f85540) (defined)
Program terminating at ACCU2 Line 8 -->
==20395== Invalid read of size 4
==20395==    at 0x506556: Crash (tools.c:3711)
==20395==    by 0x4EB9DE: Terminate (startup.c:1707)
==20395==    by 0x4EC0E6: onErrSig (startup.c:1476)
==20395==    by 0x5BAA4AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==20395==    by 0x4CCE0A: Simplify (reken.c:553)
==20395==    by 0x4CDC5D: MulRat (reken.c:428)
==20395==    by 0x4B8109: PrepPoly (proces.c:4892)
==20395==    by 0x4B9BDC: Generator (proces.c:3132)
==20395==    by 0x4C0944: Deferred (proces.c:4596)
==20395==    by 0x4B9B1C: Generator (proces.c:3115)
==20395==    by 0x4BB2B7: Generator (proces.c:3931)
==20395==    by 0x4BB2B7: Generator (proces.c:3931)
==20395==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==20395==
  21144.36 sec out of 21146.09 sec