Open yurivict opened 2 weeks ago
Please provide all the information from the debug issue template; thanks!
https://github.com/open-mpi/ompi/blob/main/.github/ISSUE_TEMPLATE/bug_report.md
I added missing bits of information.
the root cause could be not enough available space in /tmp
(unlikely per your description) or something went wrong when checking the size.
try running
env OMPI_MCA_shmem_base_verbose=100 ./hello-world-1
and check the output (useful message might have been compiled out though)
if there is nothing useful, you can
strace -o hw.strace -s 512 ./hello-world-1
then compress hw.strace
and upload it.
env OMPI_MCA_shmem_base_verbose=100 ./hello-world-1
This didn't produce anything relevant.
strace -o hw.strace -s 512 ./hello-world-1
BSDs have ktrace instead. Here is the ktrace dump: https://freebsd.org/~yuri/openmpi-kernel-dump.txt
51253 hello-world-1 CALL fstatat(AT_FDCWD,0x1b0135402080,0x4c316d20,0)
51253 hello-world-1 NAMI "/tmp/ompi.yv.0/jf.0/2909405184"
51253 hello-world-1 RET fstatat -1 errno 2 No such file or directory
51253 hello-world-1 CALL open(0x1b0135402080,0x120004<O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC>)
51253 hello-world-1 NAMI "/tmp/ompi.yv.0/jf.0/2909405184"
51253 hello-world-1 RET open -1 errno 2 No such file or directory
It looks like some directories were not created.
what if you mpirun -np 1 ./hello-world-1
instead?
sudo mpirun -np 1 ./hello-world-1
prints the same error message:
It appears as if there is not enough space for /dev/shm/sm_segment.yv.0.9f060000.0 (the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.
The log doesn't have any mkdir
operations, so that "/tmp/ompi.yv.0" was never created.
well, this is a different message that the one used when opening this issue. And this one is self explanatory.
Anyway, what if you
env OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./helloworld-1
or you can simply increase the size of /dev/shm
sudo OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./hello-world-1
produces the same error messages.
This message is for a regular user:
$ OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./hello-world-1
[yv.noip.me:88431] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.yv.1001/jf.0/1653407744/sm_segment.yv.1001.628d0000.0 could be created.
> Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88431)
< Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88431)
This message is for root:
# OMPI_MCA_shmem_mmap_backing_file_base_dir=/tmp ./hello-world-1
--------------------------------------------------------------------------
It appears as if there is not enough space for /dev/shm/sm_segment.yv.0.ee540000.0 (the shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.
Local host: yv
Space Requested: 16777216 B
Space Available: 1024 B
--------------------------------------------------------------------------
> Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88929)
< Hello world from processor yv.noip.me, rank 0 out of 1 processors (pid=88929)
I see.
try adding OMPI_MCA_btl_sm_backing_directory=/tmp
and see how it works
The error messages disappear when OMPI_MCA_btl_sm_backing_directory=/tmp is used.
We have seen and responded to this problem many times - I believe it is included in the docs somewhere. The problem is that BSD (mostly as seen on Mac) has created a default TMPDIR
that is incredibly long. So when we add our tmpdir prefix (to avoid stepping on other people's tmp), the result is longer than the path length limits.
Solution: set TMPDIR
in your environment to point to some shorter path, typically something like $HOME/tmp
.
[...] a default TMPDIR that is incredibly long [...]
What do you mean by TMPDIR? In our case TMPDIR is just /tmp.
Indeed, it seems the root cause is something fishy related to /dev/shm
what if you
df -h /dev/shm
both as a user and root?
$ df -h /dev/shm
Filesystem Size Used Avail Capacity Mounted on
devfs 1.0K 0B 1.0K 0% /dev
# df -h /dev/shm
Filesystem Size Used Avail Capacity Mounted on
devfs 1.0K 0B 1.0K 0% /dev
That's indeed a small /dev/shm
.
I still do not understand why running as a user does not get you the user friendly message you get when running as root.
can you ktrace
as a non-root user so we can figure out where the failure occurs?
It seems regular users do not have write access to the (small size) /dev/shm
and we do not display a friendly error message about it.
45163 hello-world-1 CALL access(0x4e3d8d33,0x2<W_OK>)
45163 hello-world-1 NAMI "/dev/shm"
45163 hello-world-1 RET access -1 errno 13 Permission denied
Unless you change that, your best bet is probably to add
btl_sm_backing_directory=/tmp
to your $PREFIX/etc/openmpi-mca-params.conf
Is direct access to /dev/shm new in OpenMPI? It used to work fine on FreeBSD.
How does this work on Linux? Is everybody allowed write access to /dev/shm there?
See the program below.
---program---
Version: openmpi-5.0.5_1 Describe how Open MPI was installed: FreeBSD package Computer hardware: Intel CPU Network type: Ethernet/IP (irrelevant) Available space in /tmp: 64GB FreeBSD 14.1