open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

OMPI 5.0 segfault at startup with "uct_md.c:203 UCX ERROR" error unless -mca btl ^uct is added #12224

Open tmandrus opened 8 months ago

tmandrus commented 8 months ago

Hi there, my MPI program hits a seg fault when running on a single Infiniband-enabled node. I'm trying to understand whether it's related to this issue: https://github.com/open-mpi/ompi/issues/6666.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Open MPI 5.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Built and installed from in a local directory as a non-root user. These are my config options:

openMpiDependencies=/path/to/dependencies ${openMpiRoot}/configure \ --prefix=/path/to/install \ CC=/path/to/gcc \ CXX=/path/to/g++ \ CFLAGS=-O3 \ CXXFLAGS=-O3 \ --with-ofi=$openMpiDependencies/libfabric-1.19.0/install \ --with-ofi-libdir=$openMpiDependencies/libfabric-1.19.0/install/lib \ --with-ucx=$openMpiDependencies/ucx-1.15.0/install \ --with-ucx-libdir=$openMpiDependencies/ucx-1.15.0/install/lib \ --with-hwloc=$openMpiDependencies/hwloc-2.10.0 \ --with-hwloc_libdir=$openMpiDependencies/hwloc-2.10.0/install/lib \ --enable-sparse-groups \ --enable-oshmem \ --enable-show-load-errors-by-default \ --with-libevent=internal \ --with-pmix=internal \ --with-prrte=internal \ --enable-mca-dso \ --enable-shared \

Please describe the system on which you are running


Details of the problem

When my MPI program starts up, I see a segfault and errors from uct. I'm guessing the transport should be mlx5 or mlx5_0:1, so it makes me wonder whether it's a parsing issue? Or whether the seg fault message "address not mapped to object at address" is the same as the linked issue above.

In the meantime, I'll use --mca btl ^uct as a workaround. Just hoping to understand what's happening. Happy to provide any additional information.


[1704839545.770669] [nodeName:62868:0]         uct_md.c:203  UCX  ERROR Transport 'lx5' does not exist
[1704839545.770880] [nodeName:62868:0]         uct_md.c:203  UCX  ERROR Transport '' does not exist
[1704839545.770916] [nodeName:62868:0]         uct_md.c:203  UCX  ERROR Transport 'x5_0:1' does not exist
[1704839545.770975] [nodeName:62868:0]         uct_md.c:203  UCX  ERROR Transport ':1' does not exist
[1704839545.771005] [nodeName:62868:0]         uct_md.c:203  UCX  ERROR Transport '' does not exist
[nodeName:62868:0:62868] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x83)
==== backtrace (tid:  58532) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x000000000001a0ea uct_ib_iface_event_fd_get()  ???:0
 2 0x000000000000c563 mca_btl_uct_query_tls()  ???:0
 3 0x00000000000059ae mca_btl_uct_component_init()  btl_uct_component.c:0
 4 0x00000000000ca789 mca_btl_base_select()  ???:0
 5 0x00000000000031c2 mca_bml_r2_component_init()  ???:0
 6 0x00000000000b790c mca_bml_base_init()  ???:0
 7 0x0000000000007fca mca_pml_ob1_component_init()  pml_ob1_component.c:0
 8 0x000000000011214e mca_pml_base_select()  ???:0
 9 0x000000000006f93e ompi_mpi_instance_init_common()  instance.c:0
10 0x00000000000700d2 ompi_mpi_instance_init()  ???:0
11 0x0000000000061f48 ompi_mpi_init()  ???:0
janjust commented 7 months ago

Looks like a parsing issue to me. What's your runtime command?

tmandrus commented 7 months ago

That's what I was thinking too.

We execute a launch script with mpirun, and the script just calls our program executable and tells it to run with Open MPI 5. mpirun --np 4 /path/to/launch/script -mpi openmpi5

I can point my MPI program to other MPIs (Open MPI 4, Intel MPI) and these are working as expected. So the node environment is good, I'm guessing it's either my build of OMPI5 or a bug.

janjust commented 7 months ago

Any chance you could try v5.0.1?

tmandrus commented 7 months ago

Yes I can. I'll rebuild and report back.