nasa / trick

Trick Simulation Environment. Trick provides a common set of simulation capabilities and utilities to build simulations automatically.
Other
26 stars 14 forks source link

Valgrind fails in Trick example sims with "Trace/Breakpoint trap (core dumped)" #1689

Closed ddj116 closed 2 months ago

ddj116 commented 3 months ago

Platform details

Running in the FSL on Rocky 8.9 with python 3.6.8 and gcc 8.5.0. Trick was built with ./configure && make. System valgrind is version 3.21.0.

How to replicate

Build trick/trick_sims/Ball/SIM_ball_L1 with normal trick-CP process. Then run valgrind -v --leak-check=full --error-limit=no --gen-suppressions=all --error-exitcode=234 ./S_main_Linux_8.5_x86_64.exe RUN_test/input.py. The bottom of the output will show:

==1188893== ERROR SUMMARY: 1418 errors from 1398 contexts (suppressed: 0 from 0)
Trace/breakpoint trap (core dumped)

And a new vgcore.* file will be dropped in the SIM directory.

vgcore file details

Looking at the stack via: gdb S_main_Linux_8.5_x86_64.exe vgcore.* we see:

(.venv) .../trick/trick_sims/Ball/SIM_ball_L1 > gdb ./S_main_Linux_8.5_x86_64.exe vgcore.1187504 
GNU gdb (GDB) Rocky Linux 8.2-20.el8.0.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./S_main_Linux_8.5_x86_64.exe...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 1187504]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `'.
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x0000000005053bf1 in PyRun_StringFlags () from /lib64/libpython3.6m.so.1.0
Missing separate debuginfos, use: yum debuginfo-install expat-2.2.5-11.el8.x86_64 glibc-2.28-236.el8_9.12.x86_64 gsl-2.5-1.el8.x86_64 hdf5-1.10.5-4.el8.x86_64 libaec-1.0.2-3.el8.x86_64 libgcc-8.5.0-20.el8.x86_64 libstdc++-8.5.0-20.el8.x86_64 openssl-libs-1.1.1k-12.el8_9.x86_64 python3-libs-3.6.8-56.el8_9.3.rocky.0.x86_64 udunits2-2.2.26-5.el8.x86_64 zlib-1.2.11-25.el8.x86_64
(gdb) where
#0  0x0000000005053bf1 in PyRun_StringFlags () from /lib64/libpython3.6m.so.1.0
#1  0x000000000505610c in PyRun_SimpleStringFlags () from /lib64/libpython3.6m.so.1.0
#2  0x00000000007c7682 in Trick::IPPython::init() ()
#3  0x00000000005ed76b in InputProcessorSimObject::call_function(Trick::JobData*) ()
#4  0x000000000072477b in Trick::JobData::call() ()
#5  0x00000000006b9676 in Trick::Executive::call_input_processor() ()
#6  0x00000000006bec7c in Trick::Executive::init() ()
#7  0x00000000007bed60 in master(int, char**) ()
#8  0x00000000005f2485 in main ()

More information

This appears to be new in our conversion from Trick CentOS 7.9 (gcc 4.8.5) to Rocky 8.9 (gcc 8.5.0). I was able to replicate this inside of trick_sims/Cannon/SIM_cannon_aero as well, so I assume it's present for all Trick sims but I have not tested any more example sims. I have heard of one other group that is also encountering this.

hchen99 commented 3 months ago

Wondering if you noticed since which Trick update causing the stated issue and it was running fine before?

ddj116 commented 3 months ago

valgrind works on CentOS 7.9 with gcc 4.8.5 and gcc 8.3.0 - we are using 90997929477bac02022271cd0e8a55e13fe4251c from March 2023 for that platform.

As part of the Rocky 8 upgrade we were forced to upgrade Trick to the latest release because of clang 16. The above details were replicated at commit 0db42a101292aa9081d7f997ed47a08ce433a9a1 from March 8 2024. On those Rocky 8 systems we're using the system gcc 8.5.0 with clang 16.0.6 and python 3.6.8.

If y'all have been testing Rocky 8 since 90997929, meaning you know it's stable at that old state, hypothetically you could git bisect with these two commits to see where in history this error was introduced, using the output of valgrind on a test sim as the success criteria.

sharmeye commented 2 months ago

@ddj116 I was able to reproduce this problem in Trick and also with the following sample code, using Python 3.6.8 and gcc 8.5.0 on RHEL8:

DanHasAProblem.cc

#include <Python.h>
#include <iostream>

int main()
{
    Py_Initialize();

    std::cout << "Dan, how would you suggest we fix this in Trick, a problem which exists entirely outside of Trick?  Regale us with your wisdom." << std::endl;

    Py_Finalize();

    return 0;
}
>> g++ -o DanHasAProblemButWeDont.o $(python3-config --cflags) $(python3-config --ldflags) DanHasAProblem.cc
>> valgrind -v --leak-check=full --error-limit=no --gen-suppressions=all  --error-exitcode=234 DanHasAProblemButWeDont.o
>> gdb DanHasAProblemButWeDont.o vgcore.*

Please try it out and let us know your thoughts.

ddj116 commented 2 months ago

After a screen share session today we've learned that, although we have identical gcc and valgrind versions on the (mostly same) OS RHEL/Rocky 8.9, I do not get a vgcore.* file for this exact same test setup. I can only assume that gremlins and/or cosmic rays are involved. This isn't a huge deal for us, I was really just documenting what I've found. We will likely abandon valgrind in favor of asan anyhow.

In summary:

Screenshot 2024-04-11 at 3 49 23 PM
ddj116 commented 2 months ago

Just wanted to note here for those that might come across the later that the above information applies to valgrind --tool=memcheck which is the default tool for valgrind. I just ranvalgrind --tool=callgrind` and had no issues in the example sim.