Open mmphys opened 2 years ago
Sorry to hear you are running into problems with ROMIO from MPICH-3.2.1
The patch which promotes the offending datatype to a 64 bit value is this one: https://github.com/pmodels/mpich/commit/3a479ab0 though it might not be worth backporting to whichever version of OpenMPI you are running: Openmpi has updated their ROMIO to 3.4.1 which should contain the fix.
Thanks for the pointer to the fix. Will ask whether we can update Tursa to Open MPI's ROMIO 3.4.1.
Git commit
develop HEAD 135808dcfa767edf988976ae31d2876bb6389f8b
Target Platform
University of Edinburgh Extreme Scaling system “Tursa” Each node: 2 x AMD ROME EPYC 32, Nvidia A100 (40GB), 1TB RAM
Linux tursa-login1 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Mon Jul 12 04:43:18 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Configure
Attachments
Issue Description
When MPI2 is configured to use the romio321 library for I/O, MPI_File_read_all() fails when reading >=2GB into a single MPI rank.
Issue Workaround
Other MPI2 I/O libraries do not have this limit / bug. Switching to ompio for example resolves the issue on Tursa.
Note: romio321 is currently the recommended MPI2 I/O library on Tursa. Commissioning performance tests were carried out using romio321. I see a performance hit when using ompio (~5 GBPS) instead of romio321 (~10 GBPS) on a single node, but I have not tested to see how this scales.
Minimal reproducer -- MPIRead32.cpp
MPIRead32.cpp https://github.com/mmphys/MPIRead32 is the minimal code to reproduce the issue. Note, this is independent of Grid.
To demonstrate the issue we run the following command on Tursa:
Re-running the same command, but this time choosing the ompio I/O library works around the issue:
Grid reproducer -- GaugeLoad.cpp
The issue was first noticed on Tursa when using Grid to load a Gauge field.
To demonstrate the issue we run the following command on Tursa:
Re-running the same command, but this time choosing the ompio I/O library works around the issue:
config.log grid.configure.summary.log GridMakeV1.txt MPIRead32.cpp.txt Bad.log Good.log GaugeLoad.cpp.txt GridBad.log GridGood.log