openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 426 forks source link

Segfault with explicit ud_mlx5 transport setting #7392

Open mkre opened 3 years ago

mkre commented 3 years ago

Describe the bug

Our application is giving a segfault when setting UCX_TLS=self,sm,ud_mlx5. Everything is working fine when replacing ud_mlx5 with ud_verbs or ud. Interestingly, setting UCX_LOG_LEVEL=Info UCX_TLS=self,sm.ud shows that ud seems to be translated to ud_mlx5 (but this setting is not giving a segfault as stated above):

[1631532434.573400] [wvhpc02n060:56712:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_TLS=self,sm,ud UCX_LOG_LEVEL=Info
[1631532434.586584] [wvhpc02n059:74568:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_TLS=self,sm,ud UCX_LOG_LEVEL=Info
[1631532434.594920] [wvhpc02n059:74568:0]      ucp_worker.c:1776 UCX  INFO    ep_cfg[0]: tag(self/memory0 knem/memory);
[1631532434.595645] [wvhpc02n060:56712:0]      ucp_worker.c:1776 UCX  INFO    ep_cfg[0]: tag(self/memory0 knem/memory);
[1631532434.596525] [wvhpc02n059:74568:0]      ucp_worker.c:1776 UCX  INFO    ep_cfg[1]: tag(ud_mlx5/mlx5_0:1);
[1631532434.597239] [wvhpc02n060:56712:0]      ucp_worker.c:1776 UCX  INFO    ep_cfg[1]: tag(ud_mlx5/mlx5_0:1);

So, maybe this is not an issue with the ud_mlx5 transport itself, but rather with the processing of the explicit transport setting?

Steps to Reproduce

# UCT version=1.11.1 revision c58db6b
# configured with: --prefix=/u/ydfb4q/tpl/ucx/build/1.11.1/install --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --disable-static --with-verbs=/u/ydfb4q/tpl/ucx/mofed-4.6/usr --with-rdmacm=/u/ydfb4q/tpl/ucx/mofed-4.6/usr --with-knem=/u/ydfb4q/tpl/ucx/mofed-4.6/opt/knem-1.1.3.90mlnx1 --without-java --with-gdrcopy=/u/ydfb4q/.gradle/caches/cda/tpls/gdrcopy-2.1-linux-x86_64 --with-cuda=/u/ydfb4q/.gradle/caches/cda/tpls/cuda_toolkit-11.0.2-full-linux-x86_64

Setup and versions

> cat /etc/centos-release
CentOS Linux release 7.6.1810 (Core)
> ofed_info -s
MLNX_OFED_LINUX-4.6-1.0.1.1:
> ibstat
CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.25.6000
        Hardware version: 0
        Node GUID: 0xb8599f030000a554
        System image GUID: 0xb8599f030000a554
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 9
                LMC: 0
                SM lid: 2
                Capability mask: 0x2651e848
                Port GUID: 0xb8599f030000a554
                Link layer: InfiniBand

Additional information (depending on the issue)

yosefe commented 3 years ago

@mkre is it a correct understanding that this is the stack trace and the segfault is in starccm?

libStarNeo.so: SignalHandler::signalHandlerFunction(int, siginfo_t*, void*), 
libpthread.so.0(+0xf5d0), 
libStarGraphPartitioning.so(+0x26e44), 
libStarGraphPartitioning.so(+0x33cf1), 
libStarGraphPartitioning.so(+0x2bb4a), 
libStarGraphPartitioning.so(+0x2c56d), 
libStarGraphPartitioning.so: ParMetisDriver<int>::parMETIS_partKway(int const*, int const*, int const*, int const*, int const*, int const&, int const&, int const&, int const&, float const*, float const*, int const*, int&, int*, int const&), 
libStarGraphPartitioning.so: CsrGraph<int>::partitionGraph(int const&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> >&), 
libStarPartition.so: PostRestoreFvGraphPartitioner::partition(int const&, PartitionMaps&), 
libStarPartition.so: FvRepPartitioner::partitionGraphFromFile(PartitionMaps&), 
libStarPartition.so: FvRepPartitioner::partitionFromFile(), 
libStarPartition.so: PartitionConfigManager::updateBeforeLoad(), 
libStarSolve.so: SolverManager::initializeSolutionObserverUpdate(), 
libStarNeo.so: Subject::notify(Properties&), 
libStarNeo.so: Subject::notify(), 
libStarCommon.so: Solution::initializeSolution(), 
libStarSolve.so: SimulationIterator::startSimulation(RunnableSolver*, int, SimulationIterator::RunMode, bool), 
libStarSolve.so(+0x2d2f47), 
libStarNeo.so: Controller::executeCommand(Command&, Properties const&, Properties&), 
libStarNeo.so: Controller::executeCommand(Properties const&, Properties&), 
libStarMachine.so: CommandController::SlaveCommandLoop::start(), 
libStarMachine.so: CommandController::processCommands(), 
libStarMachine.so: ServerSession::main(int, char**)]
mkre commented 3 years ago

Right @yosefe, it's in a routine doing MPI communication, so it might be that the UCX setting corrupts some memory buffers or something along these lines...?

yosefe commented 3 years ago

@mkre according to the log file of ud_mlx5, the application creates ucp_context_h and immediately destroys it, without creating worker or endpoints. Does it make sense according to the stack trace of the application?

mkre commented 3 years ago

@yosefe not sure if that makes sense. We are not using UCX directly, but only through Open MPI, so I'm not sure about much of the UCX function call specifics. I'm also not sure why the behavior should be different depending on the selection of ud or ud_mlx5.

yosefe commented 3 years ago

@mkre maybe OpenMPI unloads UCX component. Can you pls run the failing case with "-mca pml_base_verbose 100 -mca pml_ucx_verbose 100" ?

Upd: Does the failure happen when UCX is disabled, when adding "-mca pml ^ucx" ?

mkre commented 3 years ago

@yosefe

Can you pls run the failing case with "-mca pml_base_verbose 100 -mca pml_ucx_verbose 100" ?

ucx-1.11.1-ud_mlx5-ucxverbose.txt

Upd: Does the failure happen when UCX is disabled, when adding "-mca pml ^ucx" ?

Good point! The failure also happens when UCX is disabled. So it might not be an UCX issue in the end...? I'm still confused about the different behavior with ud/ud_mlx5 though...

yosefe commented 3 years ago
  1. UCX component (in ompi) disqualified itself because ud_mlx5 was missed from https://github.com/open-mpi/ompi/blob/29b4b984766eb1151c377d4dc60c1315f44009ba/opal/mca/common/ucx/common_ucx.c#L48. as workaround, can add -mca pml_ucx_tls rc_verbs,ud_verbs,rc_mlx5,dc_mlx5,cuda_ipc,rocm_ipc,ud_mlx5 @karasevb can you pls fix this in Open MPI (all relevant branches)?
[wvhpc02n059:120077] common_ucx.c:303 posix/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 sysv/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 self/memory0: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 ud_mlx5/mlx5_0:1: did not match transport list   <--- here
[wvhpc02n059:120077] common_ucx.c:303 cma/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:303 knem/memory: did not match transport list
[wvhpc02n059:120077] common_ucx.c:311 support level is none
[wvhpc02n059:120077] select: init returned failure for component ucx
  1. UCX_TLS=ud enabled both ud_verbs and ud_mlx5, so ud_verbs allowed UCX component to be selected in the first place
mkre commented 3 years ago

Thanks @yosefe. I just realized that the segfault actually seems to be related to Libfabric/OFI, which was selected because the Open MPI bug you pointed out prevented UCX from being used. Everything makes sense now. Thanks for your help! Feel free to close this issue. I guess there is no need for opening a ticket at the Open MPI bug tracker since @karasevb is already aware of the bug, right?

yosefe commented 3 years ago

I guess there is no need for opening a ticket at the Open MPI bug tracker since @karasevb is already aware of the bug, right?

Yes, that is right. Let's keep this one open for now.