paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
155 stars 111 forks source link

MPI2 romio321 library fails when reading >= 2GB per rank #381

Open mmphys opened 2 years ago

mmphys commented 2 years ago

Git commit

develop HEAD 135808dcfa767edf988976ae31d2876bb6389f8b

Target Platform

University of Edinburgh Extreme Scaling system “Tursa” Each node: 2 x AMD ROME EPYC 32, Nvidia A100 (40GB), 1TB RAM Linux tursa-login1 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Mon Jul 12 04:43:18 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Configure

../configure --enable-comms=mpi --enable-simd=GPU --enable-shm=nvlink --enable-gen-simd-width=64 --enable-accelerator=cuda --enable-accelerator-cshift --enable-unified \
--with-gmp=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/gmp-6.2.1-4qzl4yfdllwmf42zewg44gb4y54bgy2d \
--with-mpfr=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/mpfr-4.1.0-agsa52nljiqbbrzrpln5ebgclzxesm7a \
--with-fftw=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/fftw-3.3.10-bdpumbnknoewgtzgirxrvy3weveminw3 \
--with-hdf5=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/hdf5-1.10.7-qld75yuu7gpncparpqq46hvuqzz4s6zx \
--with-lime=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/c-lime-2-3-9-ie76iwlrgadc24aniq57wz5rv7dmt4b4 \
CXX=nvcc \
CXXFLAGS='-ccbin mpicxx -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared -I/mnt/lustre/tursafs1/apps/basestack/cuda-11.4/openmpi/4.1.1-cuda11.4/include’ \
LDFLAGS='-cudart shared -L/mnt/lustre/tursafs1/apps/basestack/cuda-11.4/openmpi/4.1.1-cuda11.4/lib’ \
LIBS='-lrt -lmpi’ \
--prefix=/mnt/lustre/tursafs1/home/dp207/dp207/shared/runs/semilep/code/3/Prefix

Attachments

Issue Description

When MPI2 is configured to use the romio321 library for I/O, MPI_File_read_all() fails when reading >=2GB into a single MPI rank.

Issue Workaround

Other MPI2 I/O libraries do not have this limit / bug. Switching to ompio for example resolves the issue on Tursa.

Note: romio321 is currently the recommended MPI2 I/O library on Tursa. Commissioning performance tests were carried out using romio321. I see a performance hit when using ompio (~5 GBPS) instead of romio321 (~10 GBPS) on a single node, but I have not tested to see how this scales.

Minimal reproducer -- MPIRead32.cpp

MPIRead32.cpp https://github.com/mmphys/MPIRead32 is the minimal code to reproduce the issue. Note, this is independent of Grid.

To demonstrate the issue we run the following command on Tursa:

mpirun --mca io romio321 -np 2 MPIRead32 a.out 0 2.1 2304.4608 &> Bad.log

Re-running the same command, but this time choosing the ompio I/O library works around the issue:

mpirun --mca io    ompio -np 2 MPIRead32 a.out 0 2.1 2304.4608 > Good.log

Grid reproducer -- GaugeLoad.cpp

The issue was first noticed on Tursa when using Grid to load a Gauge field.

To demonstrate the issue we run the following command on Tursa:

mpirun --mca io romio321 -np 2 GaugeLoad /mnt/lustre/tursafs1/home/dp207/dp207/shared/dwf_2+1f/F1M/ckpoint_EODWF_lat.200 --grid 48.48.48.96 --mpi 2.1.1.1 &> GridBad.log

Re-running the same command, but this time choosing the ompio I/O library works around the issue:

mpirun --mca io ompio    -np 2 GaugeLoad /mnt/lustre/tursafs1/home/dp207/dp207/shared/dwf_2+1f/F1M/ckpoint_EODWF_lat.200 --grid 48.48.48.96 --mpi 2.1.1.1  > GridGood.log

config.log grid.configure.summary.log GridMakeV1.txt MPIRead32.cpp.txt Bad.log Good.log GaugeLoad.cpp.txt GridBad.log GridGood.log

roblatham00 commented 2 years ago

Sorry to hear you are running into problems with ROMIO from MPICH-3.2.1

The patch which promotes the offending datatype to a 64 bit value is this one: https://github.com/pmodels/mpich/commit/3a479ab0 though it might not be worth backporting to whichever version of OpenMPI you are running: Openmpi has updated their ROMIO to 3.4.1 which should contain the fix.

mmphys commented 2 years ago

Thanks for the pointer to the fix. Will ask whether we can update Tursa to Open MPI's ROMIO 3.4.1.