open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 861 forks source link

mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

Open dqwu opened 2 years ago

dqwu commented 2 years ago

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

spack installation

Please describe the system on which you are running


Details of the problem

This issue occurs at a machine used by E3SM (e3sm.org) https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl

modules used: gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7

The test to reproduce this issue was run with 512 MPI tasks, 8 nodes (64 tasks per node). This issue is also reproducible with modules built with intel compiler.

With the same test, this issue is not reproducible with intel MPI on the same machine.

Error messages and backtrace in the output log file:

[1650397594.556260] [chr-0496:1587709:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587708 length=524288) returned -1: Bad address
[1650397596.163612] [chr-0500:609796:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609795 length=524288) returned -1: No such process
...
[1650397606.028410] [chr-0496:1587764:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587763 length=524288) returned -1: No such process
srun: error: chr-0494: tasks 0,9,11,25: Killed
...
[1650397612.452233] [chr-0500:609772:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609771 length=333312) returned -1: No such process
==== backtrace (tid:3089585) ====
 0 0x0000000000055969 ucs_debug_print_backtrace()  ???:0
 1 0x00000000000200a9 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042972 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024c1a ucp_worker_progress()  ???:0
 4 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 5 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 6 0x0000000000232b77 mca_pml_ucx_send()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:944
 7 0x00000000000d7c32 ompi_coll_base_sendrecv_actual()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.c:58
 8 0x00000000000d707b ompi_coll_base_sendrecv()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.h:133
 9 0x000000000010ced0 ompi_coll_tuned_allgatherv_intra_dec_fixed()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:1363
10 0x000000000016697a mca_fcoll_vulcan_file_write_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/fcoll/vulcan/fcoll_vulcan_file_write_all.c:418
11 0x00000000000c2b39 mca_common_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/common/ompio/common_ompio_file_write.c:452
12 0x00000000001aff57 mca_io_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/io/ompio/io_ompio_file_write.c:174
13 0x00000000000aaaae PMPI_File_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mpi/c/profile/pfile_write_at_all.c:75
14 0x000000000016fbc2 ncmpio_read_write()  ???:0
15 0x000000000016a7f6 mgetput()  ncmpio_wait.c:0
16 0x00000000001682bc req_aggregation()  ncmpio_wait.c:0
17 0x0000000000169e40 wait_getput()  ncmpio_wait.c:0
18 0x00000000001661a4 req_commit()  ncmpio_wait.c:0
19 0x0000000000166a0c ncmpio_wait()  ???:0
20 0x00000000000b727a ncmpi_wait_all()  ???:0
21 0x000000000046c5f9 flush_output_buffer()  ???:0
22 0x000000000042dc5e sync_file()  pio_file.c:0
23 0x000000000042df88 PIOc_closefile()  ???:0
24 0x0000000000414355 __piolib_mod_MOD_closefile()  ???:0
25 0x000000000040dd1c pioperformance_rearrtest.4019()  pioperformance_rearr.F90:0
26 0x000000000040ad15 MAIN__()  pioperformance_rearr.F90:0
27 0x0000000000411b49 main()  ???:0
28 0x00000000000237b3 __libc_start_main()  ???:0
29 0x000000000040a48e _start()  ???:0
=================================

A lock file was generated in addition to the expected output .nc file:

pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc
pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc-2399076352-875649.lock
dqwu commented 2 years ago

This test needs to be run on supercomputers or Linux clusters as it requires 512 MPI tasks.

Load openmpi module (if it is available) for testing. If the system does not have an available openmpi module, build one from source and then run export PATH=/path/to/openmpi-installation/bin:$PATH

Build and install PnetCDF lib with openmpi (if there is no available module), e.g. to /path/to/pnetcdf-installation

Steps to build the test

git clone --branch dqwu/test_openmpi https://github.com/E3SM-Project/scorpio.git
cd scorpio

tar zxf ne30_F_case_48602x72_512p.tar.gz

mkdir build
cd build

CC=mpicc CXX=mpicxx FC=mpifort \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/path/to/pnetcdf-installation \
-DPIO_USE_MALLOC=ON \
-DPIO_ENABLE_TESTS=ON \
..

make
cd tests/performance
make pioperf_rearr

cp ../../../ne30_F_case_48602x72_512p.dat ./

Submit a batch job or use an interactive job to execute the following command

mpirun -n 512 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2
brminich commented 2 years ago

@dqwu, is there any error like Transport retry count exceeded on ...? Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run? In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)

dqwu commented 2 years ago

@dqwu, is there any error like Transport retry count exceeded on ...? Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run? In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)

@brminich Yes, the log file also contains this error message: [chr-0495:3089585:0:3089585] ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)

The failing nodes can vary from run to run, as this issue was reproduced with both the debug queue and the compute queue.

jsquyres commented 2 years ago

This sounds like it is turning into an UCX issue...? FYI @open-mpi/ucx

hoopoepg commented 2 years ago

hi thank you for bug report.

it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on process_vm_readv call.

is it possible to get logs from same run where Transport retry count exceeded on and process_vm_readv messages appears?

thank you

dqwu commented 2 years ago

hi thank you for bug report.

it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on process_vm_readv call.

is it possible to get logs from same run where Transport retry count exceeded on and process_vm_readv messages appears?

thank you

@hoopoepg I have rerun the test and the full log can be viewed at the following URL: https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.193176.out

hoopoepg commented 2 years ago

hi @dqwu we don't see timestamps on error messages... was it cut by any grep-like utility?

dqwu commented 2 years ago

hi @dqwu we don't see timestamps on error messages... was it cut by any grep-like utility?

@hoopoepg test_openmpi.193176.out is the original output log file of the submitted slurm job to run this test. No changes have been made to this log file.

#!/bin/bash

#SBATCH -t 00:05:00
#SBATCH -A e3sm
#SBATCH --job-name=test_openmpi
#SBATCH --nodes=8
#SBATCH --output=test_openmpi.%j.out
#SBATCH --error=test_openmpi.%j.out
#SBATCH --partition=debug

cd $SLURM_SUBMIT_DIR

source /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/init/sh
module purge 
module load gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7

srun -n 512 -N 8 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2
hoopoepg commented 2 years ago

hmmm, timestamps are missing... not good. ok, could you run application with variables: UCX_TLS=ib UCX_LOG_LEVEL=info

and check if error Transport retry count exceeded is present in logs (and post logs here)?

thank you

dqwu commented 2 years ago

UCX_LOG_LEVEL=info

Here is the latest log file: https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195239.out

yosefe commented 2 years ago

@dqwu can you pls try the following (separate) experiments to understand the problem:

  1. Set UCX_TLS=rc to disable DC transport; to see if the issue is DC-related
  2. Set UCX_RC_MAX_RD_ATOMIC=1 to if it's related to exceeded max number of RDMA_READ operations on the QP
  3. Set UCX_RC_PATH_MTU=1024 to see if it's related to different MTU settings
  4. Check FW version by ibv_devinfo -v|grep fw_ver to see if FW upgrade could help
dqwu commented 2 years ago

@dqwu can you pls try the following (separate) experiments to understand the problem:

1. Set `UCX_TLS=rc` to disable DC transport; to see if the issue is DC-related

2. Set `UCX_RC_MAX_RD_ATOMIC=1` to if it's related to exceeded max number of RDMA_READ operations on the QP

3. Set `UCX_RC_PATH_MTU=1024` to see if it's related to different MTU settings

4. Check FW version by `ibv_devinfo -v|grep fw_ver` to see if FW upgrade could help

All of the 3 experiments failed, please see the log files.

[UCX_TLS=rc] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195609.out

[UCX_RC_MAX_RD_ATOMIC=1] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195610.out

[UCX_RC_PATH_MTU=1024] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195611.out

FW version

ibv_devinfo -v | grep fw_ver
    fw_ver:             20.29.2002
    fw_ver:             16.27.2008
    fw_ver:             16.27.2008
yosefe commented 2 years ago

@dqwu can you pls try also with UCX_TLS=sm,tcp and UCX_TLS=tcp, and post the full output from ibv_devinfo -v?

dqwu commented 2 years ago

ibv_devinfo

These two tests also failed.

[UCX_TLS=sm,tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195622.out

[UCX_TLS=tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195628.out

[ibv_devinfo] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/ibv_devinfo.txt

yosefe commented 2 years ago

Since it fails on TCP , seems not related to IB fabric can you try these separate experiments:

  1. -mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx to run OpenMPI without UCX component In this case, perhaps will need to add -mca btl_tcp_if_include <netif> to specify the Ethernet device to use
  2. -x UCX_TLS=tcp -mca coll ^hcoll to disable collective component
dqwu commented 2 years ago

The test is run with SLURM srun instead of mpirun. Should I use environment variables? [-mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx] export OMPI_MCA_pml=ob1,^ucx export OMPI_MCA_btl=self,vader,tcp It this correct?

[-x UCX_TLS=tcp -mca coll ^hcoll] export UCX_TLS=tcp export OMPI_MCA_coll=^hcoll Is this correct?

yosefe commented 2 years ago

1. export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=self,vader,tcp

2. export OMPI_MCA_coll=^hcoll export UCX_TLS=tcp

dqwu commented 2 years ago

export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=self,vader,tcp

export OMPI_MCA_coll=^hcoll export UCX_TLS=tcp

These two tests still failed.

[OMPI_MCA_pml=ob1 OMPI_MCA_btl=self,vader,tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195638.out

[OMPI_MCA_coll=^hcoll UCX_TLS=tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195640.out

edgargabriel commented 2 years ago

Can you try two things as well to change some settings for the parallel I/O part?

  1. Force using a different fcoll component, e.g. export OMPI_MCA_fcoll=dynamic

  2. Force using a different io component, e.g. export OMPI_MCA_io=^ompio

dqwu commented 2 years ago
  1. OMPI_MCA_fcoll=dynamic

[OMPI_MCA_fcoll=dynamic] This test still failed https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195680.out

[OMPI_MCA_io=^ompio] This test passed https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195718.out

edgargabriel commented 2 years ago

ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar)

dqwu commented 2 years ago

ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar) @edgargabriel I do not have a simpler reproducer so far. You can try to follow the detailed steps provided in the 1st comment.

jayeshkrishna commented 2 years ago

Just checking, has this issue been fixed? (if not, do you have any approx timeline on when this issue will be fixed?)

edgargabriel commented 2 years ago

I am really sorry, but I don't have a platform to reproduce the issue, and parallel I/O is at this point also not part of my work description at my new job, so my resources are somewhat limited to resolve this issue. I would recommend to stick with the work around that was mentioned previously on the ticket.