intel / oneapi-containers

BSD 3-Clause "New" or "Revised" License
111 stars 45 forks source link

mpirun BAD TERMINATION (Segmentation fault) when my application needs more memory #42

Closed shmilee closed 1 year ago

shmilee commented 1 year ago

I compile a simulation application in intel hpc container built by below:

FROM intel/oneapi-hpckit:2023.0.0-devel-ubuntu20.04
ENV LANG=C.UTF-8 LC_ALL=C.UTF-8 \
    UBUNTU_CODENAME=focal \
    UBUNTU_MIRROR=http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ \
    GTC_VENDOR=intel \
    GTC_HOME=/opt/gtc

RUN echo "deb $UBUNTU_MIRROR $UBUNTU_CODENAME main restricted universe multiverse" > /etc/apt/sources.list \
    && echo "deb $UBUNTU_MIRROR $UBUNTU_CODENAME-updates main restricted universe multiverse" >> /etc/apt/sources.list \
    && apt-get update \
    && apt-get install -y --no-install-recommends --no-install-suggests \
        make libncurses5-dev python \
        gfortran \
        zlib1g-dev libcurl4-openssl-dev \
    && ln -s mpif90 /opt/intel/oneapi/mpi/2021.8.0/bin/mpifort \
    && apt-get -y autoremove && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && useradd -u 1000 -g 100 -m -d ${GTC_HOME} gtc
USER gtc
WORKDIR ${GTC_HOME}
CMD ["/bin/bash"]

Then run cmd: docker run --rm -i -t --name gtc_worker --shm-size=64gb XXX/image:tag bash, where --shm-size= is used to solve Bus error.

The app works well when the grids are small, like 50x300, but it crashes when grids are 50x310.

After set I_MPI_DEBUG=10, I get some info:

[0] MPI startup(): shm segment size (128 MB per rank) * (32 local ranks) = 4125 MB total
[16] impi_shm_mbind_local(): mbind(p=0x7f92665cb000, size=1073741824) error=1 "Operation not permitted"
[0] impi_shm_mbind_local(): mbind(p=0x7f541cd0b000, size=1073741824) error=1 "Operation not permitted"
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): max number of MPI_Request per vci: 67108864 (pools: 1)
[0] MPI startup(): libfabric provider: tcp;ofi_rxm

[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: -1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: progress_threads: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): tag bits available: 19 (TAG_UB value: 524287) 
[0] MPI startup(): source bits available: 20 (Maximal number of rank: 1048575) 
[0] MPI startup(): Rank    Pid      Node name     Pin cpu
[0] MPI startup(): 0       439      271de697c521  {0,1,48}

[0] MPI startup(): 30      469      271de697c521  {45,46,93}
[0] MPI startup(): 31      470      271de697c521  {47,94,95}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.8.0
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=10

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 439 RUNNING AT 271de697c521
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 16 PID 455 RUNNING AT 271de697c521
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

Maybe changing this part shm segment size (128 MB per rank) will solve the issue??? So how can I do that?

shmilee commented 1 year ago

Some urls may be useful: https://jp.xlsoft.com/documents/intel/mpi/2021/mpi-devguide-linux.pdf https://www.intel.com/content/www/us/en/docs/mpi-library/developer-guide-linux/2021-6/error-message-bad-termination.html

shmilee commented 1 year ago

solved:

shmilee commented 1 year ago

Debug info from core file:

Core was generated by `./gtc'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000663925 in lap2petsc () at poisson.F90:908
908   nindex0=nindexlap
(gdb) bt
#0  0x0000000000663925 in lap2petsc () at poisson.F90:908
Backtrace stopped: Cannot access memory at address 0x7ffe1af21138
(gdb) info locals
nindexlap = <error reading variable nindexlap (value requires 547524 bytes, which is more than max-value-size)>

BTY Bus error info about MPIDI_POSIX_eager_init:

Core was generated by `./flc/test_flc'.
Program terminated with signal 7, Bus error.
#0  MPIDI_POSIX_eager_init (global_rank=1, num_global=177087) at ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_init.h:2939
2939    ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_init.h: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 numactl-devel-2.0.12-5.el7.x86_64
(gdb) bt
#0  MPIDI_POSIX_eager_init (global_rank=1, num_global=177087) at ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_init.h:2939
#1  0x00007f789007e520 in MPIDI_POSIX_eager_init (rank=<optimized out>, size=<optimized out>) at ../../src/mpid/ch4/shm/posix/eager/include/posix_eager_impl.h:25
#2  MPIDI_POSIX_mpi_init_hook (rank=1, size=177087, n_vcis_provided=0x0, tag_bits=0x2514e) at ../../src/mpid/ch4/shm/posix/posix_init.c:133
#3  0x00007f789018a676 in MPIDI_SHMI_mpi_init_hook (rank=1, size=177087, n_vcis_provided=0x0, tag_bits=0x2514e) at ../../src/mpid/ch4/shm/src/shm_init.c:28
#4  0x00007f788fb8a93a in MPID_Init (argc=0x1, argv=0x2b3bf, requested=0, provided=0x2514e) at ../../src/mpid/ch4/src/ch4_init.c:1293
#5  0x00007f788fea41a3 in MPIR_Init_thread (argc=0x1, argv=0x2b3bf, required=0, provided=0x2514e) at ../../src/mpi/init/initthread.c:142
#6  0x00007f788fea371b in PMPI_Init (argc=0x1, argv=0x2b3bf) at ../../src/mpi/init/init.c:140
#7  0x00007f789129b85b in pmpi_init_ (ierr=0x1) at ../../src/binding/fortran/mpif_h/initf.c:275
#8  0x000000000041201f in test () at ./flc/test.F90:10
#9  0x0000000000405ce2 in main ()
#10 0x00007f788eca4555 in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000405be3 in _start ()
(gdb)