memtt / malt

MALT is a MALloc Tracker to find where and how your made your memory allocations in C/C++/Fortran applications.
https://memtt.github.io/malt/
Other
106 stars 15 forks source link

LD_PRELOAD seg fault #86

Open aulwes opened 9 months ago

aulwes commented 9 months ago

I followed the instructions for fixing the LD_PRELOAD problem by building malt with -DLIBUNWIND_PREFIX=/path/to/libunwind. But I continue to get a segfault when running and it occurs after the app finishes running. If I try to add the malt runtime option '-s libunwind', I get the LD_PRELOAD segfault immediately. Is there something else I can try?

aulwes commented 9 months ago

Some more information. I modified one of the tests, TestSimpleStackTracker, by parallelizing it using MPI_Init/MPI_Finalize. I then rebuilt Malt using mpicc/mpic++ and the Cray MPICH version 8.1.25. I tested this on one of our Cray clusters that uses Slurm resource manager. I ran with 'srun -n 1 /path/to/malt --mpi ./src/lib/tests/TestSimpleStackTracker'. I then get this error

TestSimpleStackTracker: /users/rta/workspace/malt/src/lib/common/SimpleAllocator.cpp:179: void MALT::SimpleAllocator::free(void *): Assertion `unusedMemory <= totalMemory' failed. /usr/projects/perfeng/utils/malt/ro/rta/bin/malt: line 458: 134989 Aborted (core dumped) LD_PRELOAD="${MPI_WRAPPER_DIR}/libmaltmpi.so:${MALT_LIB}:${LD_PRELOAD}" "$@"

svalat commented 9 months ago

Hello, thanks for reporting the issue.

I would ask two things to help debugging:

  1. Can you extract the values of unusedMemory and totalMemory to see if one of them is 0 or totally wrong value ?
  2. For the segfault, if in case you can get a core dump to know where it appears, at least with symbol name or better source line ?
aulwes commented 9 months ago

For 1., I see

unusedMemory = 262328, totalMemory = 262144

Let me work on answering 2.

On Dec 21, 2023, at 12:16 PM, Sébastien Valat @.**@.>> wrote:

Hello, thanks for reporting the issue.

I would ask two things to help debugging:

  1. Can you extract the values of unusedMemory and totalMemory to see if one of them is 0 or totally wrong value ?
  2. For the segfault, if in case you can get a core dump to know where it appears, at least with symbol name or better source line ?

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/memtt/malt/issues/86*issuecomment-1866811652__;Iw!!Bt8fGhp8LhKGRg!EJ2cP3vvUQN7QwKE1tWxC0H7vJs5K-oC-uKEi0slsODV_leC_y-RNhkfAnTMJXqGwAPzOzkdeamykdLvFznQsQ$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB46TOXLNHB7KIGNJI4QR33YKSDI5AVCNFSM6AAAAABA3T2WBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWHAYTCNRVGI__;!!Bt8fGhp8LhKGRg!EJ2cP3vvUQN7QwKE1tWxC0H7vJs5K-oC-uKEi0slsODV_leC_y-RNhkfAnTMJXqGwAPzOzkdeamykdJuKSOoqQ$. You are receiving this because you authored the thread.Message ID: @.***>

aulwes commented 9 months ago

Here is the backtrace I get from a core dump:

0 0x000014ceba351cbb in raise () from /lib64/libc.so.6

1 0x000014ceba353355 in abort () from /lib64/libc.so.6

2 0x000014ceba349cba in __assert_fail_base () from /lib64/libc.so.6

3 0x000014ceba349d42 in __assert_fail () from /lib64/libc.so.6

4 0x000014cebc2bad75 in MALT::SimpleAllocator::free(void*) ()

from /usr/projects/perfeng/utils/malt/ro/rta/lib64/libmalt.so

5 0x000014cebc2b4cf0 in std::_Rb_tree<MALT::StackSTLHashMap::Key, std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo>, std::_Select1st<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >, std::less<MALT::StackSTLHashMap::Key>, MALT::STLInternalAllocator<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> > >::_M_erase(std::_Rb_tree_node<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >*) () from /usr/projects/perfeng/utils/malt/ro/rta/lib64/libmalt.so

6 0x000014cebc2b4ce1 in std::_Rb_tree<MALT::StackSTLHashMap::Key, std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo>, std::_Select1st<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >, std::less<MALT::StackSTLHashMap::Key>, MALT::STLInternalAllocator<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> > >::_M_erase(std::_Rb_tree_node<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >*) () from /usr/projects/perfeng/utils/malt/ro/rta/lib64/libmalt.so

7 0x000014cebc2b4ce1 in std::_Rb_tree<MALT::StackSTLHashMap::Key, std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo>, std::_Select1st<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >, std::less<MALT::StackSTLHashMap::Key>, MALT::STLInternalAllocator<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> > >::_M_erase(std::_Rb_tree_node<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >*) () from /usr/projects/perfeng/utils/malt/ro/rta/lib64/libmalt.so

8 0x000014cebc2b4ce1 in std::_Rb_tree<MALT::StackSTLHashMap::Key, std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo>, std::_Select1st<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >, std::less<MALT::StackSTLHashMap::Key>, MALT::STLInternalAllocator<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> > >::_M_erase(std::_Rb_tree_node<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >*) () from /usr/projects/perfeng/utils/malt/ro/rta/lib64/libmalt.so

9 0x000014cebc2b4ce1 in std::_Rb_tree<MALT::StackSTLHashMap::Key, std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo>, std::_Select1st<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >, std::less<MALT::StackSTLHashMap::Key>, MALT::STLInternalAllocator<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> > >::_M_erase(std::_Rb_tree_node<std::pair<MALT::StackSTLHashMap::Key const, MALT::CallStackInfo> >*) () from /usr/projects/perfeng/utils/malt/ro/rta/lib64/libmalt.so

10 0x000014cebc2c0972 in AllocWrapperGlobal::onExit() ()

from /usr/projects/perfeng/utils/malt/ro/rta/lib64/libmalt.so

11 0x000014cebc5c4743 in _dl_fini () from /lib64/ld-linux-x86-64.so.2

12 0x000014ceba354ae9 in __run_exit_handlers () from /lib64/libc.so.6

13 0x000014ceba354c7a in exit () from /lib64/libc.so.6

14 0x000014ceba33c2a4 in __libc_start_main () from /lib64/libc.so.6

15 0x00000000004175fa in _start () at ../sysdeps/x86_64/start.S:120

On Dec 21, 2023, at 12:16 PM, Sébastien Valat @.**@.>> wrote:

Hello, thanks for reporting the issue.

I would ask two things to help debugging:

  1. Can you extract the values of unusedMemory and totalMemory to see if one of them is 0 or totally wrong value ?
  2. For the segfault, if in case you can get a core dump to know where it appears, at least with symbol name or better source line ?

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/memtt/malt/issues/86*issuecomment-1866811652__;Iw!!Bt8fGhp8LhKGRg!EJ2cP3vvUQN7QwKE1tWxC0H7vJs5K-oC-uKEi0slsODV_leC_y-RNhkfAnTMJXqGwAPzOzkdeamykdLvFznQsQ$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AB46TOXLNHB7KIGNJI4QR33YKSDI5AVCNFSM6AAAAABA3T2WBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWHAYTCNRVGI__;!!Bt8fGhp8LhKGRg!EJ2cP3vvUQN7QwKE1tWxC0H7vJs5K-oC-uKEi0slsODV_leC_y-RNhkfAnTMJXqGwAPzOzkdeamykdJuKSOoqQ$. You are receiving this because you authored the thread.Message ID: @.***>

aulwes commented 6 months ago

Hi, I'm continuing to get this LD_PRELOAD segfault, but not on all apps that I run with malt. Is there anything else I can try?

aulwes commented 6 months ago

I think I've found the problem. One of the apps we're profiling is built with Intel compilers. When I built malt with icx/icpx using Intel 2021 compilers, then I don't get the segfault. For the other apps, I used gcc 10.

svalat commented 6 months ago

Hi, sorry didn't has yet time to investigate.

But as you pointed, there could be a problem due to mix of C++ libraries (intel / gnu).

Have you tried to also compile MALT with icpc so everything is under intel (malt & the app) ?

aulwes commented 6 months ago

Yes, I compiled a version using Intel icpx/icx and that worked. thank you!

svalat commented 6 months ago

Hum, thanks very much for the reporting, that's good to know.

I had the impression up to now that there was no issue in that case, but apparently yes.