openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 418 forks source link

Segmentation fault in HCOLL/UCX #7391

Open benmenadue opened 3 years ago

benmenadue commented 3 years ago

Describe the bug

Some applications are failing with a segfault in hcoll callback mcast_ucx_recv_completion_cb. Reported to Mellanox Support (since that's part of hcoll, case 00956842), and they suggested opening this here as well. Traceback is

[gadi-gpu-v100-0036:2629319:0:2629319] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x39)
==== backtrace (tid:2629319) ====
 0 0x0000000000012b20 .annobin_sigaction.c()  sigaction.c:0
 1 0x000000000001de09 mcast_ucx_recv_completion_cb()  ???:0
 2 0x000000000005ec69 ucp_eager_only_handler()  ???:0
 3 0x000000000004de0c uct_dc_mlx5_iface_progress_ll()  :0
 4 0x0000000000038cda ucp_worker_progress()  ???:0
 5 0x0000000000015c76 hmca_bcol_ucx_p2p_progress_fast()  bcol_ucx_p2p_component.c:0
 6 0x0000000000063223 hcoll_ml_progress_impl()  ???:0
 7 0x00000000001f2ec3 opal_progress()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/opal/../../source/openmpi-4.1.1/opal/runtime/opal_progress.c:231
 8 0x0000000000008c10 wait_callback()  vmc.c:0
 9 0x000000000001f0d4 mcast_p2p_recv()  bcol_ucx_p2p_module.c:0
10 0x000000000000c44d do_bcast()  vmc.c:0
11 0x000000000000d6c1 vmc_bcast_multiroot()  ???:0
12 0x00000000000030c0 hmca_mcast_vmc_bcast_multiroot()  mcast_vmc.c:0
13 0x0000000000013710 hmca_bcol_ucx_p2p_bcast_mcast_multiroot()  ???:0
14 0x00000000000156cf hmca_bcol_ucx_p2p_barrier_selector_init()  bcol_ucx_p2p_barrier.c:0
15 0x0000000000049a05 hmca_coll_ml_barrier_intra()  ???:0
16 0x00000000001b4f1a mca_coll_hcoll_barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/../../../../../source/openmpi-4.1.1/ompi/mca/coll/hcoll/coll_hcoll_ops.c:29
17 0x00000000002254b8 PMPI_Barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/pbarrier.c:74
18 0x00000000002254b8 PMPI_Barrier()  /jobfs/26573579.gadi-pbs/0/openmpi/4.1.1/gcc-opt/ompi/pbarrier.c:40
19 0x00000000004b5710 amrex::ParallelDescriptor::Barrier()  ???:0
20 0x000000000046071e AMRSimulation<ShellProblem>::evolve()  ???:0
21 0x0000000000423c98 problem_main()  ???:0
22 0x000000000041b110 main()  ???:0
23 0x0000000000023493 __libc_start_main()  ???:0
24 0x000000000041f66e _start()  ???:0
=================================

Steps to Reproduce

For the above traceback:

Setup and versions

ibv_devinfo.txt

yosefe commented 3 years ago

@benmenadue For now, we'll continue working on it through Mellanox support as an HCOLL issue

vspetrov commented 2 years ago

@benmenadue is that possible to get reproducer? How many nodes are used in the run?

BenWibking commented 2 years ago

You should be able to reproduce this by running https://github.com/BenWibking/quokka/blob/development/scripts/shell-64nodes.pbs (without the flag to disable hcoll multicast, of course). This was a 64 node run, 4x GPUs per node.

vspetrov commented 2 years ago

is it possible to run the app on CPU? is it only reproduced when running on GPU?

BenWibking commented 2 years ago

It can be run on CPU as well. I haven't tried running at that scale on CPU. The crashes also appear to be somewhat nondeterministic.