ufs-community / UFS_UTILS

Utilities for the NCEP models.
Other
20 stars 103 forks source link

GNU build on Hera is failing #962

Open GeorgeGayno-NOAA opened 3 weeks ago

GeorgeGayno-NOAA commented 3 weeks ago

The head of develop (2794d41) no longer compiles on Hera with Gnu. I get this error:

Lmod has detected the following error:  The following module(s) are unknown: "openmpi/4.1.5"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "openmpi/4.1.5"

Also make sure that all modulefiles written in TCL start with the string #%Module

Executing this command requires loading "openmpi/4.1.5" which failed while processing the following module(s):

    Module fullname      Module Filename
    ---------------      ---------------
    stack-openmpi/4.1.5  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/gcc/9.2.0/stack-openmpi/4.1.5.lua
    build.hera.gnu       /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS.upstream/modulefiles/build.hera.gnu.lua
While processing the following module(s):
    Module fullname      Module Filename
    ---------------      ---------------
    stack-openmpi/4.1.5  /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/gcc/9.2.0/stack-openmpi/4.1.5.lua
    build.hera.gnu       /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS.upstream/modulefiles/build.hera.gnu.lua
GeorgeGayno-NOAA commented 3 weeks ago

@AlexanderRichert-NOAA - FYI

AlexanderRichert-NOAA commented 3 weeks ago

I'll look into this, but tagging @climbfuji who may have a more immediate answer on matters of OpenMPI on Hera.

All I can see in terms of system modules is openmpi/4.1.6_gnu9.2.0 ...

GeorgeGayno-NOAA commented 3 weeks ago

I'll look into this, but tagging @climbfuji who may have a more immediate answer on matters of OpenMPI on Hera.

All I can see in terms of system modules is openmpi/4.1.6_gnu9.2.0 ...

When I tried openmpi 4.1.6, other libraries would no longer load. I went around in circles before giving up.

GeorgeGayno-NOAA commented 3 weeks ago

I have v1.7 working. I will check in my branch so you can take a look.

AlexanderRichert-NOAA commented 3 weeks ago

Also tagging @RatkoVasic-NOAA in case he knows of recent changes -- it looks like the modification date on the openmpi module file is this last Tuesday the 11th.

I just created an issue for this under spack-stack: https://github.com/JCSDA/spack-stack/issues/1146

RatkoVasic-NOAA commented 3 weeks ago

@GeorgeGayno-NOAA @AlexanderRichert-NOAA Yes. openmpi/4.1.5 was built on CeontOS and new one (openmpi/4.1.6) was built on Rocky OS. Since that transition some applications were not working correctly with new GNU (i.e. couple of coupled tests in ufs-weather-model which are still turned off). Natalie was (is) working on installing libraries using newer version of GNU (13.x) + openmpi/4.1.6 and had some success, but still not finished.

AlexanderRichert-NOAA commented 3 weeks ago

I'm confused-- Why was it working a few days ago but isn't now? Did someone revert the configuration back to trying to use 4.1.5..?

RatkoVasic-NOAA commented 3 weeks ago

We don't use 4.1.5 for some time (/scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles/openmpi/4.1.5). We use in SRW now (going with spack-stack 1.6.0):

prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/installs/openmpi/modulefiles")
prepend_path("MODULEPATH", "/scratch2/NCEPDEV/stmp1/role.epic/spack-stack/spack-stack-1.6.0_gnu13/envs/ufs-wm-srw-rocky8/install/modulefiles/Core")

load("stack-gcc/13.3.0")
load("stack-openmpi/4.1.6")
load("stack-python/3.10.13")
load("cmake/3.23.1")

load("srw_common")

load(pathJoin("nccmp", os.getenv("nccmp_ver") or "1.9.0.1"))
load(pathJoin("nco", os.getenv("nco_ver") or "5.1.6"))
load(pathJoin("openblas", os.getenv("openblas_ver") or "0.3.24"))

prepend_path("CPPFLAGS", " -I/apps/slurm_hera/23.11.3/include/slurm"," ")
prepend_path("LD_LIBRARY_PATH", "/apps/slurm_hera/23.11.3/lib")
setenv("LD_PRELOAD", "/scratch2/NCEPDEV/stmp1/role.epic/installs/gnu/13.3.0/lib64/libstdc++.so.6")
AlexanderRichert-NOAA commented 3 weeks ago

I don't follow. How is it that the modules/MODULEPATH settings in https://github.com/ufs-community/UFS_UTILS/blob/develop/modulefiles/build.hera.gnu.lua were working until a few days ago but aren't working now? Did something about the modulefiles change so that it's not pointing to the spack-stack-specific OpenMPI 4.1.5 installation?

RatkoVasic-NOAA commented 3 weeks ago

I didn't know about UFS_UTILS... I was talking about WM and SRW. I can take a look into that modulefile.

RatkoVasic-NOAA commented 3 weeks ago

@GeorgeGayno-NOAA try now

RatkoVasic-NOAA commented 3 weeks ago

@AlexanderRichert-NOAA I manually added line:

RatkoVasic-NOAA commented 3 weeks ago

prepend_path("MODULEPATH", "/scratch1/NCEPDEV/jcsda/jedipara/spack-stack/modulefiles") in /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/modulefiles/gcc/9.2.0/stack-openmpi/4.1.5.lua

AlexanderRichert-NOAA commented 3 weeks ago

Thanks. I can now load the stack-openmpi module, and for that matter build UFS_UTILS@develop without any modifications.

GeorgeGayno-NOAA commented 2 weeks ago

UFS_UTILS now compiles, but the regression tests fail:

+ srun /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/chgres_cube/../../exec/chgres_cube '1>&1' '2>&2'
[h22c32:2743796] mca_base_component_repository_open: unable to open mca_pmix_s1: libpmi.so.0: cannot open shared object file: No such file or directory (ignored)

For more details, see this log file: /scratch1/NCEPDEV/da/George.Gayno/ufs_utils.git/UFS_UTILS/reg_tests/chgres_cube/consistency.log01.fail

AlexanderRichert-NOAA commented 2 weeks ago

Can you try /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8-ompi416/install/modulefiles/Core? Note the openmpi version change to 4.1.6. This stack uses the Hera admin-provided openmpi (I'm not sure why this wasn't used in the rocky8 rebuild for 1.6.0).

GeorgeGayno-NOAA commented 2 weeks ago

Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.

The first global_cycle regression test had a seg fault in the sfcsub.F routine.

 qc of snow
 snow set to zero over open sea at       363185  points (   61.575147840711807      percent)
 performing qc of snow     mode=           1 (0=count only, 1=replace)
 set snow temp to tsfsmx if greater
 performing qc of tsfc     mode=           1 (0=count only, 1=replace)
 performing qc of tsf2     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of zorc     mode=           1 (0=count only, 1=replace)

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped)
srun: Terminating StepId=62230736.0
slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 ***
srun: error: h35m50: tasks 0-3: Terminated
srun: Force Terminated StepId=62230736.0
+ export ERR=143

Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.

GeorgeGayno-NOAA commented 1 week ago

Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.

The first global_cycle regression test had a seg fault in the sfcsub.F routine.

 qc of snow
 snow set to zero over open sea at       363185  points (   61.575147840711807      percent)
 performing qc of snow     mode=           1 (0=count only, 1=replace)
 set snow temp to tsfsmx if greater
 performing qc of tsfc     mode=           1 (0=count only, 1=replace)
 performing qc of tsf2     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of zorc     mode=           1 (0=count only, 1=replace)

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped)
srun: Terminating StepId=62230736.0
slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 ***
srun: error: h35m50: tasks 0-3: Terminated
srun: Force Terminated StepId=62230736.0
+ export ERR=143

Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.

This test was repeated with 05b6fc2. The results were the same.

GeorgeGayno-NOAA commented 4 days ago

Using ad8c76f, I was able to compile using Gnu on Hera. The unit tests passed. All regression tests (except one) ran to completion. Some passed. Some differed from the baseline, although the differences were very small.

The first global_cycle regression test had a seg fault in the sfcsub.F routine.

 qc of snow
 snow set to zero over open sea at       363185  points (   61.575147840711807      percent)
 performing qc of snow     mode=           1 (0=count only, 1=replace)
 set snow temp to tsfsmx if greater
 performing qc of tsfc     mode=           1 (0=count only, 1=replace)
 performing qc of tsf2     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of albc     mode=           1 (0=count only, 1=replace)
 performing qc of zorc     mode=           1 (0=count only, 1=replace)

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: h35m50: tasks 4-5: Segmentation fault (core dumped)
srun: Terminating StepId=62230736.0
slurmstepd: error: *** STEP 62230736.0 ON h35m50 CANCELLED AT 2024-06-21T20:44:38 ***
srun: error: h35m50: tasks 0-3: Terminated
srun: Force Terminated StepId=62230736.0
+ export ERR=143

Fixing this seg fault is beyond the scope of this issue. I will make a note and open another issue to address it.

This test was repeated using 4dca77a. The results were the same.