Open tmandrus opened 8 months ago
Looks like a parsing issue to me. What's your runtime command?
That's what I was thinking too.
We execute a launch script with mpirun
, and the script just calls our program executable and tells it to run with Open MPI 5.
mpirun --np 4 /path/to/launch/script -mpi openmpi5
I can point my MPI program to other MPIs (Open MPI 4, Intel MPI) and these are working as expected. So the node environment is good, I'm guessing it's either my build of OMPI5 or a bug.
Any chance you could try v5.0.1?
Yes I can. I'll rebuild and report back.
Hi there, my MPI program hits a seg fault when running on a single Infiniband-enabled node. I'm trying to understand whether it's related to this issue: https://github.com/open-mpi/ompi/issues/6666.
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
Open MPI 5.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Built and installed from in a local directory as a non-root user. These are my config options:
openMpiDependencies=/path/to/dependencies ${openMpiRoot}/configure \ --prefix=/path/to/install \ CC=/path/to/gcc \ CXX=/path/to/g++ \ CFLAGS=-O3 \ CXXFLAGS=-O3 \ --with-ofi=$openMpiDependencies/libfabric-1.19.0/install \ --with-ofi-libdir=$openMpiDependencies/libfabric-1.19.0/install/lib \ --with-ucx=$openMpiDependencies/ucx-1.15.0/install \ --with-ucx-libdir=$openMpiDependencies/ucx-1.15.0/install/lib \ --with-hwloc=$openMpiDependencies/hwloc-2.10.0 \ --with-hwloc_libdir=$openMpiDependencies/hwloc-2.10.0/install/lib \ --enable-sparse-groups \ --enable-oshmem \ --enable-show-load-errors-by-default \ --with-libevent=internal \ --with-pmix=internal \ --with-prrte=internal \ --enable-mca-dso \ --enable-shared \
Please describe the system on which you are running
Details of the problem
When my MPI program starts up, I see a segfault and errors from uct. I'm guessing the transport should be
mlx5
ormlx5_0:1
, so it makes me wonder whether it's a parsing issue? Or whether the seg fault message "address not mapped to object at address" is the same as the linked issue above.In the meantime, I'll use --mca btl ^uct as a workaround. Just hoping to understand what's happening. Happy to provide any additional information.