Open mkre opened 3 years ago
@mkre is it a correct understanding that this is the stack trace and the segfault is in starccm?
libStarNeo.so: SignalHandler::signalHandlerFunction(int, siginfo_t*, void*),
libpthread.so.0(+0xf5d0),
libStarGraphPartitioning.so(+0x26e44),
libStarGraphPartitioning.so(+0x33cf1),
libStarGraphPartitioning.so(+0x2bb4a),
libStarGraphPartitioning.so(+0x2c56d),
libStarGraphPartitioning.so: ParMetisDriver<int>::parMETIS_partKway(int const*, int const*, int const*, int const*, int const*, int const&, int const&, int const&, int const&, float const*, float const*, int const*, int&, int*, int const&),
libStarGraphPartitioning.so: CsrGraph<int>::partitionGraph(int const&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> >&),
libStarPartition.so: PostRestoreFvGraphPartitioner::partition(int const&, PartitionMaps&),
libStarPartition.so: FvRepPartitioner::partitionGraphFromFile(PartitionMaps&),
libStarPartition.so: FvRepPartitioner::partitionFromFile(),
libStarPartition.so: PartitionConfigManager::updateBeforeLoad(),
libStarSolve.so: SolverManager::initializeSolutionObserverUpdate(),
libStarNeo.so: Subject::notify(Properties&),
libStarNeo.so: Subject::notify(),
libStarCommon.so: Solution::initializeSolution(),
libStarSolve.so: SimulationIterator::startSimulation(RunnableSolver*, int, SimulationIterator::RunMode, bool),
libStarSolve.so(+0x2d2f47),
libStarNeo.so: Controller::executeCommand(Command&, Properties const&, Properties&),
libStarNeo.so: Controller::executeCommand(Properties const&, Properties&),
libStarMachine.so: CommandController::SlaveCommandLoop::start(),
libStarMachine.so: CommandController::processCommands(),
libStarMachine.so: ServerSession::main(int, char**)]
Right @yosefe, it's in a routine doing MPI communication, so it might be that the UCX setting corrupts some memory buffers or something along these lines...?
@mkre according to the log file of ud_mlx5, the application creates ucp_context_h and immediately destroys it, without creating worker or endpoints. Does it make sense according to the stack trace of the application?
@yosefe not sure if that makes sense. We are not using UCX directly, but only through Open MPI, so I'm not sure about much of the UCX function call specifics. I'm also not sure why the behavior should be different depending on the selection of ud
or ud_mlx5
.
@mkre maybe OpenMPI unloads UCX component. Can you pls run the failing case with "-mca pml_base_verbose 100 -mca pml_ucx_verbose 100" ?
Upd: Does the failure happen when UCX is disabled, when adding "-mca pml ^ucx" ?
@yosefe
Can you pls run the failing case with "-mca pml_base_verbose 100 -mca pml_ucx_verbose 100" ?
ucx-1.11.1-ud_mlx5-ucxverbose.txt
Upd: Does the failure happen when UCX is disabled, when adding "-mca pml ^ucx" ?
Good point! The failure also happens when UCX is disabled. So it might not be an UCX issue in the end...? I'm still confused about the different behavior with ud
/ud_mlx5
though...
-mca pml_ucx_tls rc_verbs,ud_verbs,rc_mlx5,dc_mlx5,cuda_ipc,rocm_ipc,ud_mlx5
@karasevb can you pls fix this in Open MPI (all relevant branches)?[wvhpc02n059:120077] common_ucx.c:303 posix/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 sysv/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 self/memory0: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 ud_mlx5/mlx5_0:1: did not match transport list <--- here
[wvhpc02n059:120077] common_ucx.c:303 cma/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 knem/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:311 support level is none
[wvhpc02n059:120077] select: init returned failure for component ucx
Thanks @yosefe. I just realized that the segfault actually seems to be related to Libfabric/OFI, which was selected because the Open MPI bug you pointed out prevented UCX from being used. Everything makes sense now. Thanks for your help! Feel free to close this issue. I guess there is no need for opening a ticket at the Open MPI bug tracker since @karasevb is already aware of the bug, right?
I guess there is no need for opening a ticket at the Open MPI bug tracker since @karasevb is already aware of the bug, right?
Yes, that is right. Let's keep this one open for now.
Describe the bug
Our application is giving a segfault when setting
UCX_TLS=self,sm,ud_mlx5
. Everything is working fine when replacingud_mlx5
withud_verbs
orud
. Interestingly, settingUCX_LOG_LEVEL=Info UCX_TLS=self,sm.ud
shows thatud
seems to be translated toud_mlx5
(but this setting is not giving a segfault as stated above):So, maybe this is not an issue with the
ud_mlx5
transport itself, but rather with the processing of the explicit transport setting?Steps to Reproduce
Setup and versions
Additional information (depending on the issue)
ucx_info -d
: ucx_info_d.txtUCX_TLS=self,sm,ud_mlx5
(fails): ucx-1.11.1-ud_mlx5.txtUCX_TLS=self,sm,ud
(works): ucx-1.11.1-ud.txt.gz