teuben / nemo

a Stellar Dynamics Toolbox (Not Everybody Must Observe)
https://astronemo.readthedocs.io
GNU General Public License v2.0
56 stars 40 forks source link

hackcode1 segfaults intermittently #98

Open teuben opened 2 years ago

teuben commented 2 years ago

hackcode1 now intermittendly segfaults. was already the case on Ubuntu20, persisting on U22. Slight correction: it's actually a bus error

teuben commented 2 years ago

it seems on those machines where it crashes the default buiild gives a coredumping program, but the re-compile

         mknemo -t -T hackcode1

gives a working code.

teuben commented 1 year ago

The branch issue98 has a script to simplify triggering the bug. Making some progress there, but this sure is a hard nut to crack.

jcldc commented 1 year ago

From latest (NEMO) master version on github, and from macosx platform/clang, hackcode1 segfault, which fails io_nemo_test (make check)

teuben commented 1 year ago

So far I was not able to crash it on an AMD, but I agree I could crash it on Mac as well. Compiler on Intel bug? I was able to be gdb and see the structure with members point to random 64bit values, where it then segfaults. I need a rainy day.

jcldc commented 1 year ago

I need too a rainy day to dig on it.

teuben commented 1 year ago

for the record: the script cash100 in src/nbody/evolve/hackcode/hackcode1 is what I've been using to trigger a crash.

I also just realized zeno's treecode is essentially the same code as hackcode1. Need a snowy day for that.

jcldc commented 1 year ago

"snowy day " :D

teuben commented 1 year ago

on an amusing note, past weekend it was raining a lot, and I installed mac in a virtual (QEMU) box, via the sosumi tool. It also fails in this environment. No surprise, since it also died on native mac.

On the other hand, I also tried the zeno 'treecode', and it never crashed. Also ran another 100 compilations of NEMO on an AMD. It did not crash.

teuben commented 1 year ago

ran into a case where the bug was also triggered in hackforce, replacing it with hackforce_qp solved it.

Note added: the crash100 script will also make hackcode1_qp to fail eventually.

teuben commented 1 year ago

using typedef long atype; instead of using a short, did not resolve the bug.

jcldc commented 1 year ago

Actually, after a new fresh install of NEMO, I got again the segmentation fault core dumped from hackforce (during io_nemo test suite).

I was able to find out the faulty line : line 125 in src/nbody/evolve/hackcode/hackcode1/grav.c

121 
122 local bool subdivp(nodeptr p,      /* body/cell to be tested */
123                    real dsq)       /* size of cell squared */
124 {
125     if (Type(p) == BODY)                        /* at tip of tree?          */
126         return (FALSE);                         /*   then cant subdivide    */

The debugger says that the pointer on p is not null, but it is not probably pointing on an allocated part of the memory, that's why it crash.

Then I recompiled hackforce by turning off "-O2" option from $NEMOLIB/makedefs, and then no errors (no core dumped) when running hackforce.

Finally I put back "-O2" option in makedefs file, and the error/core dumped vanished !!!! no more core dumped by running hackforce.

That's really really weird.

teuben commented 1 year ago

was yours a segfault or a bus error? bus error pointed to alignment error, and the body/node has a "short type", which made me suspicious. I made it a long, this didn't fix it. Also tried single precision NEMO, also didnt solve it. It has something to do with casting between body and node, and overlaying those structs (see defs.h).

I've documented some more cases I tested in the crash100 script. Everything is just bizarre. As I said, the mother of all bugs.

and on single precision NEMO the error was a segfault, not bus error as in the default double precision.

jcldc commented 1 year ago

In my case it was Bus errror (core dumped) but which vanished once I recompiled the code....

teuben commented 1 year ago

Robert Zhang noted that flipping the quad and subp[] in the cell typedef made it work. This hinted that for hackcode1 it was not including the right .o file, which pointed as a Makefile that was not strict enough.

Thus, consider this bug fixed, a pull request will follow.

jcldc commented 1 year ago

Robert Zhang noted that flipping the quad and subp[] in the cell typedef made it work. This hinted that for hackcode1 it was not including the right .o file, which pointed as a Makefile that was not strict enough.

Thus, consider this bug fixed, a pull request will follow.

weird...