Open GeorgeGayno-NOAA opened 3 weeks ago
@AlexanderRichert-NOAA - FYI
I'll look into this, but tagging @climbfuji who may have a more immediate answer on matters of OpenMPI on Hera.
All I can see in terms of system modules is openmpi/4.1.6_gnu9.2.0 ...
I'll look into this, but tagging @climbfuji who may have a more immediate answer on matters of OpenMPI on Hera.
All I can see in terms of system modules is openmpi/4.1.6_gnu9.2.0 ...
When I tried openmpi 4.1.6, other libraries would no longer load. I went around in circles before giving up.
I have v1.7 working. I will check in my branch so you can take a look.
Also tagging @RatkoVasic-NOAA in case he knows of recent changes -- it looks like the modification date on the openmpi module file is this last Tuesday the 11th.
I just created an issue for this under spack-stack: https://github.com/JCSDA/spack-stack/issues/1146
@GeorgeGayno-NOAA @AlexanderRichert-NOAA Yes. openmpi/4.1.5 was built on CeontOS and new one (openmpi/4.1.6) was built on Rocky OS. Since that transition some applications were not working correctly with new GNU (i.e. couple of coupled tests in ufs-weather-model which are still turned off). Natalie was (is) working on installing libraries using newer version of GNU (13.x) + openmpi/4.1.6 and had some success, but still not finished.
I'm confused-- Why was it working a few days ago but isn't now? Did someone revert the configuration back to trying to use 4.1.5..?
We don't use 4.1.5 for some time (/scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles/openmpi/4.1.5). We use in SRW now (going with spack-stack 1.6.0):
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/openmpi/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/spack-stack/spack-stack-1.6.0_gnu13/envs/ufs-wm-srw-rocky8/install/modulefiles/Core")
load("stack-gcc/13.3.0")
load("stack-openmpi/4.1.6")
load("stack-python/3.10.13")
load("cmake/3.23.1")
load("srw_common")
load(pathJoin("nccmp", os.getenv("nccmp_ver") or "1.9.0.1"))
load(pathJoin("nco", os.getenv("nco_ver") or "5.1.6"))
load(pathJoin("openblas", os.getenv("openblas_ver") or "0.3.24"))
prepend_path("CPPFLAGS", " -I/apps/slurm_hera/23.11.3/include/slurm"," ")
prepend_path("LD_LIBRARY_PATH", "/apps/slurm_hera/23.11.3/lib")
setenv("LD_PRELOAD", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/13.3.0/lib64/libstdc++.so.6")
I don't follow. How is it that the modules/MODULEPATH settings in https://github.com/ufs-community/UFS_UTILS/blob/develop/modulefiles/build.hera.gnu.lua were working until a few days ago but aren't working now? Did something about the modulefiles change so that it's not pointing to the spack-stack-specific OpenMPI 4.1.5 installation?
I didn't know about UFS_UTILS... I was talking about WM and SRW. I can take a look into that modulefile.
@GeorgeGayno-NOAA try now
@AlexanderRichert-NOAA I manually added line:
prepend_path("MODULEPATH", "/scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles")
in
/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/gcc/9.2.0/stack-openmpi/4.1.5.lua
Thanks. I can now load the stack-openmpi module, and for that matter build UFS_UTILS@develop without any modifications.
UFS_UTILS now compiles, but the regression tests fail:
+ srun /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/chgres_cube/../../exec/chgres_cube '1>&1' '2>&2'
[h22c32:2743796] mca_base_component_repository_open: unable to open mca_pmix_s1: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)
For more details, see this log file: /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/chgres_cube/consistency.log01.fail
Can you try /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8-ompi416/install/modulefiles/Core? Note the openmpi version change to 4.1.6. This stack uses the Hera admin-provided openmpi (I'm not sure why this wasn't used in the rocky8 rebuild for 1.6.0).
Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.
The first global_cycle regression test had a seg fault in the sfcsub.F routine.
qc of snow
snow set to zero over open sea at 363185 points ( 61.575147840711807 percent)
performing qc of snow mode= 1 (0=count only, 1=replace)
set snow temp to tsfsmx if greater
performing qc of tsfc mode= 1 (0=count only, 1=replace)
performing qc of tsf2 mode= 1 (0=count only, 1=replace)
performing qc of albc mode= 1 (0=count only, 1=replace)
performing qc of albc mode= 1 (0=count only, 1=replace)
performing qc of albc mode= 1 (0=count only, 1=replace)
performing qc of albc mode= 1 (0=count only, 1=replace)
performing qc of zorc mode= 1 (0=count only, 1=replace)
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped)
srun: Terminating StepId=62230736.0
slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 ***
srun: error: h35m50: tasks 0-3: Terminated
srun: Force Terminated StepId=62230736.0
+ export ERR=143
Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.
Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.
The first global_cycle regression test had a seg fault in the sfcsub.F routine.
qc of snow snow set to zero over open sea at 363185 points ( 61.575147840711807 percent) performing qc of snow mode= 1 (0=count only, 1=replace) set snow temp to tsfsmx if greater performing qc of tsfc mode= 1 (0=count only, 1=replace) performing qc of tsf2 mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of zorc mode= 1 (0=count only, 1=replace) Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Backtrace for this error: srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped) srun: Terminating StepId=62230736.0 slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 *** srun: error: h35m50: tasks 0-3: Terminated srun: Force Terminated StepId=62230736.0 + export ERR=143
Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.
This test was repeated with 05b6fc2. The results were the same.
Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.
The first global_cycle regression test had a seg fault in the sfcsub.F routine.
qc of snow snow set to zero over open sea at 363185 points ( 61.575147840711807 percent) performing qc of snow mode= 1 (0=count only, 1=replace) set snow temp to tsfsmx if greater performing qc of tsfc mode= 1 (0=count only, 1=replace) performing qc of tsf2 mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of albc mode= 1 (0=count only, 1=replace) performing qc of zorc mode= 1 (0=count only, 1=replace) Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Backtrace for this error: srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped) srun: Terminating StepId=62230736.0 slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 *** srun: error: h35m50: tasks 0-3: Terminated srun: Force Terminated StepId=62230736.0 + export ERR=143
Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.
This test was repeated using 4dca77a. The results were the same.
The head of develop (2794d41) no longer compiles on Hera with Gnu. I get this error: