Open dqwu opened 2 years ago
This test needs to be run on supercomputers or Linux clusters as it requires 512 MPI tasks.
Load openmpi module (if it is available) for testing. If the system does not have an available openmpi module, build one from source and then run
export PATH=/path/to/openmpi-installation/bin:$PATH
Build and install PnetCDF lib with openmpi (if there is no available module), e.g. to /path/to/pnetcdf-installation
Steps to build the test
git clone --branch dqwu/test_openmpi https://github.com/E3SM-Project/scorpio.git
cd scorpio
tar zxf ne30_F_case_48602x72_512p.tar.gz
mkdir build
cd build
CC=mpicc CXX=mpicxx FC=mpifort \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/path/to/pnetcdf-installation \
-DPIO_USE_MALLOC=ON \
-DPIO_ENABLE_TESTS=ON \
..
make
cd tests/performance
make pioperf_rearr
cp ../../../ne30_F_case_48602x72_512p.dat ./
Submit a batch job or use an interactive job to execute the following command
mpirun -n 512 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2
@dqwu, is there any error like Transport retry count exceeded on ...
?
Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run?
In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)
@dqwu, is there any error like
Transport retry count exceeded on ...
? Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run? In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)
@brminich Yes, the log file also contains this error message:
[chr-0495:3089585:0:3089585] ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
The failing nodes can vary from run to run, as this issue was reproduced with both the debug queue and the compute queue.
This sounds like it is turning into an UCX issue...? FYI @open-mpi/ucx
hi thank you for bug report.
it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on process_vm_readv
call.
is it possible to get logs from same run where Transport retry count exceeded on
and process_vm_readv
messages appears?
thank you
hi thank you for bug report.
it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on
process_vm_readv
call.is it possible to get logs from same run where
Transport retry count exceeded on
andprocess_vm_readv
messages appears?thank you
@hoopoepg I have rerun the test and the full log can be viewed at the following URL: https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.193176.out
hi @dqwu we don't see timestamps on error messages... was it cut by any grep-like utility?
hi @dqwu we don't see timestamps on error messages... was it cut by any grep-like utility?
@hoopoepg test_openmpi.193176.out is the original output log file of the submitted slurm job to run this test. No changes have been made to this log file.
#!/bin/bash
#SBATCH -t 00:05:00
#SBATCH -A e3sm
#SBATCH --job-name=test_openmpi
#SBATCH --nodes=8
#SBATCH --output=test_openmpi.%j.out
#SBATCH --error=test_openmpi.%j.out
#SBATCH --partition=debug
cd $SLURM_SUBMIT_DIR
source /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/init/sh
module purge
module load gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7
srun -n 512 -N 8 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2
hmmm, timestamps are missing... not good. ok, could you run application with variables: UCX_TLS=ib UCX_LOG_LEVEL=info
and check if error Transport retry count exceeded
is present in logs (and post logs here)?
thank you
UCX_LOG_LEVEL=info
Here is the latest log file: https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195239.out
@dqwu can you pls try the following (separate) experiments to understand the problem:
UCX_TLS=rc
to disable DC transport; to see if the issue is DC-relatedUCX_RC_MAX_RD_ATOMIC=1
to if it's related to exceeded max number of RDMA_READ operations on the QP UCX_RC_PATH_MTU=1024
to see if it's related to different MTU settingsibv_devinfo -v|grep fw_ver
to see if FW upgrade could help@dqwu can you pls try the following (separate) experiments to understand the problem:
1. Set `UCX_TLS=rc` to disable DC transport; to see if the issue is DC-related 2. Set `UCX_RC_MAX_RD_ATOMIC=1` to if it's related to exceeded max number of RDMA_READ operations on the QP 3. Set `UCX_RC_PATH_MTU=1024` to see if it's related to different MTU settings 4. Check FW version by `ibv_devinfo -v|grep fw_ver` to see if FW upgrade could help
All of the 3 experiments failed, please see the log files.
[UCX_TLS=rc] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195609.out
[UCX_RC_MAX_RD_ATOMIC=1] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195610.out
[UCX_RC_PATH_MTU=1024] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195611.out
FW version
ibv_devinfo -v | grep fw_ver
fw_ver: 20.29.2002
fw_ver: 16.27.2008
fw_ver: 16.27.2008
@dqwu can you pls try also with UCX_TLS=sm,tcp
and UCX_TLS=tcp
, and post the full output from ibv_devinfo -v
?
ibv_devinfo
These two tests also failed.
[UCX_TLS=sm,tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195622.out
[UCX_TLS=tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195628.out
[ibv_devinfo] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/ibv_devinfo.txt
Since it fails on TCP , seems not related to IB fabric can you try these separate experiments:
-mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx
to run OpenMPI without UCX component
In this case, perhaps will need to add -mca btl_tcp_if_include <netif>
to specify the Ethernet device to use-x UCX_TLS=tcp -mca coll ^hcoll
to disable collective componentThe test is run with SLURM srun instead of mpirun. Should I use environment variables? [-mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx] export OMPI_MCA_pml=ob1,^ucx export OMPI_MCA_btl=self,vader,tcp It this correct?
[-x UCX_TLS=tcp -mca coll ^hcoll] export UCX_TLS=tcp export OMPI_MCA_coll=^hcoll Is this correct?
1. export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=self,vader,tcp
2. export OMPI_MCA_coll=^hcoll export UCX_TLS=tcp
export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=self,vader,tcp
export OMPI_MCA_coll=^hcoll export UCX_TLS=tcp
These two tests still failed.
[OMPI_MCA_pml=ob1 OMPI_MCA_btl=self,vader,tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195638.out
[OMPI_MCA_coll=^hcoll UCX_TLS=tcp] https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195640.out
Can you try two things as well to change some settings for the parallel I/O part?
Force using a different fcoll component, e.g. export OMPI_MCA_fcoll=dynamic
Force using a different io component, e.g. export OMPI_MCA_io=^ompio
- OMPI_MCA_fcoll=dynamic
[OMPI_MCA_fcoll=dynamic] This test still failed https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195680.out
[OMPI_MCA_io=^ompio] This test passed https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195718.out
ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar)
ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar) @edgargabriel I do not have a simpler reproducer so far. You can try to follow the detailed steps provided in the 1st comment.
Just checking, has this issue been fixed? (if not, do you have any approx timeline on when this issue will be fixed?)
I am really sorry, but I don't have a platform to reproduce the issue, and parallel I/O is at this point also not part of my work description at my new job, so my resources are somewhat limited to resolve this issue. I would recommend to stick with the work around that was mentioned previously on the ticket.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
spack installation
Please describe the system on which you are running
Details of the problem
This issue occurs at a machine used by E3SM (e3sm.org) https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl
modules used: gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7
The test to reproduce this issue was run with 512 MPI tasks, 8 nodes (64 tasks per node). This issue is also reproducible with modules built with intel compiler.
With the same test, this issue is not reproducible with intel MPI on the same machine.
Error messages and backtrace in the output log file:
A lock file was generated in addition to the expected output .nc file: