Closed luojw-dwr closed 1 month ago
The segmentation fault is now observed to be random but very frequent, as an accidental no-fault exit happens in my machine.
Hi @luojw-dwr,
thanks for the detailed report! I tried to reproduce the issue, but unfortunately without success yet.
Some followup questions that could help for reproduction:
It could also help if you try running MtKaHyPar in debug mode and report the output. I.e., call cmake in the build folder as follows (note: disabling address sanitizer since it doesn't seem to work with python):
cmake .. -DCMAKE_BUILD_TYPE=DEBUG -DKAHYPAR_ADD_ADDRESS_SANITIZER=Off
Then build the python library again and see whether the output changes (e.g., if we are lucky there might be a failing assertion instead of the segmentation fault).
Note: I didn't build with the same gcc version yet, I will probably try this next week.
Hi @N-Maas
I am working on Ubuntu 24.04. The problem exists both in normal execution and in REPL. The minimal reproducible input a.hmetis
is:
1 2
1 2
Larger inputs with ~1e7 vertices ~1e7 hyperedges also reproduce, no matter w/ or w/o weights.
The segmentation fault seems to disappear if the program after this code piece runs long enough (e.g. tens of minutes).
Sorry that I am working in a shared enviroment with fragile dependencies. Thank you for your local reproduction. I will try figure it out locally.
Thank you for your time and reply.
During make
after cmake .. -DCMAKE_BUILD_TYPE=DEBUG
in an empty build folder, I get the following log (where $HOME
is due to my handle replacement in the log):
Running main() from $HOME/kahypar_home/mt-kahypar/external_tools/googletest/googletest/src/gtest_main.cc
[==========] Running 67 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 13 tests from MtKaHyPar
[ RUN ] MtKaHyPar.ReadHypergraphFile
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: reference binding to misaligned address 0x7eb45b00a460 for type 'struct __as_base ', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: constructor call on misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 88 4c 1c c6 60 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 88 4c 1c c6 60 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 88 4c 1c c6 60 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:41:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:41:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a460: note: pointer points here
00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: reference binding to misaligned address 0x7eb45b00a520 for type 'struct __as_base ', which requires 64 byte alignment
0x7eb45b00a520: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: member access within misaligned address 0x7eb45b00a520 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a520: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
$HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: member access within misaligned address 0x7eb45b00a520 for type 'struct function_invoker', which requires 64 byte alignment
0x7eb45b00a520: note: pointer points here
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
... and many other such lines, and at last:
Assertion is_aligned(p, alignment == 0 ? alignof(T) : alignment) failed (located in the assert_pointer_valid function, line in file: 217)
This looks strange to me, as for my first build where I compiled as RELEASE
, such logs didn't appear. I will try another DEBUG
build with sanitizer disabled.
I noticed that my mkl and tbb have different version string, which may be the cause. I will try and report.
Edit: I reinstalled Intel oneAPI and re-compile from scratch. The version string remains the same. The behavior has not changed.
FYI, with a DEBUG
build with sanitizer disabled in another empty folder, during make
, I get the following log:
[ RUN ] APartitioner.PartitionsAGraphInTwoBlocksWithQualityPreset
[multilevel_coarsener.h:387:matchVertices]: Assertion `_matching_state[rep] == static_cast<uint8_t>(MatchingState::MATCHED)` failed:
Aborted (core dumped)
which seems to be irrelevant, and more strangely, the alignment assertion is now silent.
With this build, with the new python so
, the segmentation fault reproduces still silently.
Can you try cmake .. -DKAHYPAR_DOWNLOAD_TBB=On -DCMAKE_BUILD_TYPE=Debug -DKAHYPAR_ADD_ADDRESS_SANITIZER=Off
and let me know if this changes anything? This downloads a specified TBB version and builds mt-kahypar with it. Don't worry, that TBB version is not installed globally. We've had some trouble with different TBB versions in the past.
Can you try
cmake .. -DKAHYPAR_DOWNLOAD_TBB=On -DCMAKE_BUILD_TYPE=Debug -DKAHYPAR_ADD_ADDRESS_SANITIZER=Off
and let me know if this changes anything? This downloads a specified TBB version and builds mt-kahypar with it. Don't worry, that TBB version is not installed globally. We've had some trouble with different TBB versions in the past.
Hi @larsgottesbueren Thank you for your reply. I want to make sure in advance: what if the TBB searched by LD_LIBRARY_PATH differs from the downloaded TBB?
Hi @larsgottesbueren
I tried your cmake
in another empty build folder. The behavior is the same as cmake .. -DCMAKE_BUILD_TYPE=DEBUG -DKAHYPAR_ADD_ADDRESS_SANITIZER=Off
, as in:
FYI, with a
DEBUG
build with sanitizer disabled in another empty folder, duringmake
, I get the following log:[ RUN ] APartitioner.PartitionsAGraphInTwoBlocksWithQualityPreset [multilevel_coarsener.h:387:matchVertices]: Assertion `_matching_state[rep] == static_cast<uint8_t>(MatchingState::MATCHED)` failed: Aborted (core dumped)
which seems to be irrelevant, and more strangely, the alignment assertion is now silent.
With this build, with the new python
so
, the segmentation fault reproduces still silently.
Given that we are not able to reproduce the issue on our side, let's discuss some possible steps to debug this issue.
1) Have you tried getting this to work on a different machine? Laptop or university lab machine...whatever is available 2) Would you be open to debugging this in C++ yourself?
Can you try
cmake .. -DKAHYPAR_DOWNLOAD_TBB=On -DCMAKE_BUILD_TYPE=Debug -DKAHYPAR_ADD_ADDRESS_SANITIZER=Off
and let me know if this changes anything? This downloads a specified TBB version and builds mt-kahypar with it. Don't worry, that TBB version is not installed globally. We've had some trouble with different TBB versions in the past.Hi @larsgottesbueren Thank you for your reply. I want to make sure in advance: what if the TBB searched by LD_LIBRARY_PATH differs from the downloaded TBB?
With a clean build, it should automatically use the downloaded version. The -DKAHYPAR_DOWNLOAD_TBB
doesn't work well with a partial build, so a clean build is most likely required. You can check by running ldd
on the created .so
file and see whether it points into the build directory for libtbb
.
Given that we are not able to reproduce the issue on our side, let's discuss some possible steps to debug this issue.
1. Have you tried getting this to work on a different machine? Laptop or university lab machine...whatever is available 2. Would you be open to debugging this in C++ yourself?
I also can't reproduce the issue using the same gcc version. However, I now get the mentioned alignment errors in the tests. So the latter might be related to the compiler version.
Some additional questions:
Hi @larsgottesbueren and @N-Maas
I am using x86_64. My environment do not have reliable clang.
Sorry that I am hurrying in some project. Currently, debugging this is not of my highest priority. I will try debugging this later. Thank you for your time and patience.
By the way, I'd like to highlight the following phenomenon during the build, hope this will give some hint.
During
make
aftercmake .. -DCMAKE_BUILD_TYPE=DEBUG
in an empty build folder, I get the following log (where$HOME
is due to my handle replacement in the log):Running main() from $HOME/kahypar_home/mt-kahypar/external_tools/googletest/googletest/src/gtest_main.cc [==========] Running 67 tests from 2 test suites. [----------] Global test environment set-up. [----------] 13 tests from MtKaHyPar [ RUN ] MtKaHyPar.ReadHypergraphFile $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: reference binding to misaligned address 0x7eb45b00a460 for type 'struct __as_base ', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: constructor call on misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 88 4c 1c c6 60 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/detail/_task.h:214:31: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct task', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 88 4c 1c c6 60 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 88 4c 1c c6 60 62 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:41:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:41:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:9: runtime error: member access within misaligned address 0x7eb45b00a460 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a460: note: pointer points here 00 00 00 00 70 19 1c 7f b4 7e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: reference binding to misaligned address 0x7eb45b00a520 for type 'struct __as_base ', which requires 64 byte alignment 0x7eb45b00a520: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:40:5: runtime error: member access within misaligned address 0x7eb45b00a520 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a520: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ $HOME/intel/oneapi/tbb/2021.13/include/oneapi/tbb/parallel_invoke.h:42:33: runtime error: member access within misaligned address 0x7eb45b00a520 for type 'struct function_invoker', which requires 64 byte alignment 0x7eb45b00a520: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
... and many other such lines, and at last:
Assertion is_aligned(p, alignment == 0 ? alignof(T) : alignment) failed (located in the assert_pointer_valid function, line in file: 217)
This looks strange to me, as for my first build where I compiled as
RELEASE
, such logs didn't appear. I will try anotherDEBUG
build with sanitizer disabled.
Hi @N-Maas
My trial with downloaded cmake as cmake .. -DKAHYPAR_DOWNLOAD_TBB=On -DCMAKE_BUILD_TYPE=Debug -DKAHYPAR_ADD_ADDRESS_SANITIZER=Off
is here:
Hi @larsgottesbueren I tried your
cmake
in another empty build folder. The behavior is the same ascmake .. -DCMAKE_BUILD_TYPE=DEBUG -DKAHYPAR_ADD_ADDRESS_SANITIZER=Off
, as in:FYI, with a
DEBUG
build with sanitizer disabled in another empty folder, duringmake
, I get the following log:[ RUN ] APartitioner.PartitionsAGraphInTwoBlocksWithQualityPreset [multilevel_coarsener.h:387:matchVertices]: Assertion `_matching_state[rep] == static_cast<uint8_t>(MatchingState::MATCHED)` failed: Aborted (core dumped)
which seems to be irrelevant, and more strangely, the alignment assertion is now silent. With this build, with the new python
so
, the segmentation fault reproduces still silently.
Unluckily, the adminstrator of my server refuses to provide other compilers.
Fortunately, mtkhp.initializeThreadPool(1)
avoids the segmentation fault. Thanks for all your time and patience.
edit: For those who are concerned with g++-13, -fno-ipa-stack-alignment
does not help. Please refer to #192 .
edit: The problem is now confirmed to be strongly coupled with the TBB version.
Testing with clang-18
and clang++-18
in Release
mode replicates the segmentation fault.
edit: Debug
(with ASAN) build in clang++ does not report misalignment. Instead, memory leaks are reported by ASAN, which seems to be irrelevant.
edit: Debug
(without ASAN) build in clang++ passes the tests during make
without errors. I'm trying with the python binding, but struggling with undefined symbol to ubsan in python3 REPL.
-2: Ubuntu 24.04, x86_64 -1: mt-kahypar: git hash 39e06c755cb909387a66eed24ee385ba5ad393d4
I will try -DKAHYPAR_DOWNLOAD_TBB=On
. Notice that cmake
prioritizes the previously installed TBB over the downloaded TBB, which is confirmed by ldd
.
With the downloaded TBB, python3 now correctly exits. Thank you for all your time and patience :-D
-2: Ubuntu 24.04, x86_64 -1: mt-kahypar: git hash 39e06c755cb909387a66eed24ee385ba5ad393d4
-DKAHYPAR_DOWNLOAD_TBB=On
A deterministic behavior of segmentation fault on python exit is observed on my machine. Here is a minimal example on my machine:
MtKaHyPar
.del HG
even if callgc.collect()
after that, however the segmentation fault still happens at exit in these two cases.My dependency versions are: -1: mt-kahypar: git hash 035e595b762bb490f71ad1668ad00547e64653b4
2024.2.1.1052021.13 (seems my mkl and tbb differs in version)Please help to see if it is reproducible on developers' side.