Open jkbk2004 opened 6 months ago
@uturuncoglu @RatkoVasic-NOAA This issue could be an issue with openmpi (especially old version of gnu) on hera. But worth to note that the issue became visible at the call ESMF_InfoBroadcast(info, rootPet=fcstPetList(1), rc=rc).
An ticket about this issue was created on ESMF support.
An update for Hera GNU:
Spack-stacks 1.5.1 and 1.6.0 with packages for ufs-weather-model and ufs-srweather-app have been built on Hera with GNU/13.3.0 compiler. Spack-stack v1.6.0 built with ESMF/8.6.1 and MAPL/2.46.0.
A first check of running the RTs: some pass, some RT fail
a couple of tests do fail with memory issues (spack-stack v1.5.1)
More testing is needed maybe on the specific tests.
Locations of the spack-stacks (NB: packages for UFS-WM and UFS-SRW only!)
/scratch2/NCEPDEV/stmp1/role.epic/spack-stack/spack-stack-1.6.0_gnu13.3/envs/ufs-wm-srw-rocky8 /scratch2/NCEPDEV/stmp1/role.epic/spack-stack/spack-stack-1.5.1/envs/ufs-wm-srw-rocky8/
My WM tests with spack-stack-1.6.0 are in /scratch1/NCEPDEV/nems/Natalie.Perlin/ufs-weather-model
and with spack-stack-1.5.1 (run with -w option) are in /scratch1/NCEPDEV/nems/Natalie.Perlin/ufs-weather-model2/
A modulefile for using spack-stack-1.6.0: /scratch1/NCEPDEV/nems/Natalie.Perlin/_ufs-weather-model/modulefiles/ufshera.gnu.lua
help([[
loads UFS Model prerequisites for Hera/GNU
]])
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/openmpi/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/spack-stack/spack-stack-1.6.0_gnu13.3/envs/ufs-wm-srw-rocky8/install/modulefiles/Core")
stack_gnu_ver=os.getenv("stack_gnu_ver") or "13.3.0"
load(pathJoin("stack-gcc", stack_gnu_ver))
stack_openmpi_ver=os.getenv("stack_openmpi_ver") or "4.1.6"
load(pathJoin("stack-openmpi", stack_openmpi_ver))
cmake_ver=os.getenv("cmake_ver") or "3.23.1"
load(pathJoin("cmake", cmake_ver))
load("ufs_common")
nccmp_ver=os.getenv("nccmp_ver") or "1.9.0.1"
load(pathJoin("nccmp", nccmp_ver))
prepend_path("CPPFLAGS", " -I/apps/slurm_hera/23.11.3/include/slurm"," ")
prepend_path("LD_LIBRARY_PATH", "/apps/slurm_hera/23.11.3/lib")
setenv("CC", "mpicc")
setenv("CXX", "mpic++")
setenv("FC", "mpif90")
setenv("CMAKE_Platform", "hera.gnu")
whatis("Description: UFS build environment")
The ufs_common.lua for use with spack-stack1.6.0:
whatis("Description: UFS build environment common libraries")
help([[Load UFS Model common libraries]])
local ufs_modules = {
{["jasper"] = "2.0.32"},
{["zlib"] = "1.2.13"},
{["libpng"] = "1.6.37"},
{["hdf5"] = "1.14.0"},
{["netcdf-c"] = "4.9.2"},
{["netcdf-fortran"] = "4.6.1"},
{["parallelio"] = "2.5.10"},
{["esmf"] = "8.6.1"},
{["fms"] = "2023.04"},
{["bacio"] = "2.4.1"},
{["crtm"] = "2.4.0.1"},
{["g2"] = "3.4.5"},
{["g2tmpl"] = "1.10.2"},
{["ip"] = "4.3.0"},
{["sp"] = "2.5.0"},
{["w3emc"] = "2.10.0"},
{["gftl-shared"] = "1.6.1"},
{["mapl"] = "2.46.0-esmf-8.6.1"},
{["scotch"] = "7.0.4"},
}
for i = 1, #ufs_modules do
for name, default_version in pairs(ufs_modules[i]) do
local env_version_name = string.gsub(name, "-", "_") .. "_ver"
load(pathJoin(name, os.getenv(env_version_name) or default_version))
end
end
A modulefile for using spack-stack-1.5.1: /scratch1/NCEPDEV/nems/Natalie.Perlin/_ufs-weather-model2/modulefiles/ufshera.gnu.lua
help([[
loads UFS Model prerequisites for Hera/GNU
]])
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/openmpi/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/spack-stack/spack-stack-1.5.1/envs/ufs-wm-srw-rocky8/install/modulefiles/Core")
stack_gnu_ver=os.getenv("stack_gnu_ver") or "13.3.0"
load(pathJoin("stack-gcc", stack_gnu_ver))
stack_openmpi_ver=os.getenv("stack_openmpi_ver") or "4.1.6"
load(pathJoin("stack-openmpi", stack_openmpi_ver))
cmake_ver=os.getenv("cmake_ver") or "3.23.1"
load(pathJoin("cmake", cmake_ver))
load("ufs_common")
nccmp_ver=os.getenv("nccmp_ver") or "1.9.0.1"
load(pathJoin("nccmp", nccmp_ver))
prepend_path("CPPFLAGS", " -I/apps/slurm_hera/23.11.3/include/slurm"," ")
prepend_path("LD_LIBRARY_PATH", "/apps/slurm_hera/23.11.3/lib")
setenv("CC", "mpicc")
setenv("CXX", "mpic++")
setenv("FC", "mpif90")
setenv("CMAKE_Platform", "hera.gnu")
whatis("Description: UFS build environment")
I tested @natalie-perlin installation, and tests that were failing on Hera using GNU compiler now work. There are so many other tests to be done. @jkbk2004 I suggest weather-model group to test because some of tests are failing just because of not bit-identical results (which is expected).
All the regression tests with gnu/13.3.0 compiler and spack-stack/1.6.0 have successfully passed for the weather model, please see a full comment: https://github.com/ufs-community/ufs-weather-model/pull/2093#issuecomment-2143694396
Description
The tests passed on hercules. It is possibly caused by either the version of the GNU compiler or the version of the MPI library. The line that causes the hang was identified is the line: https://github.com/NOAA-EMC/fv3atm/pull/775/files#diff-dc3da9b9c37c068b769128e69328ab808bb6a17947cae75342a9a462cebf63ebR1187
The test also works with default mpi tasks on hera. Need to follow with the issue to ESMF team.
Turned off the test case on hera in https://github.com/ufs-community/ufs-weather-model/pull/2128
To Reproduce:
Additional context
Failure message from error log for cpld_debug_p8 and cpld_control_p8 gnu.
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release. Workarounds are to run on a single node, or to use a system with an RDMA capable network such as Infiniband.
Output