ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

HPC-stack on Gaea built with new intel-classic/2022.2.1 and mpich/7.7.20: testing needed #1753

Closed natalie-perlin closed 1 year ago

natalie-perlin commented 1 year ago

Description

Updated C3 and C4 partitions on Gaea no longer has the same intel compilers and cray-mpich used to build hpc-stack for the UFS-SRW. New compilers that are available on both C3 and C4 are intel-classic/2022.2.1and cray-mpich/7.7.20 (in addition to PrgEnv-intel/6.0.10-classic and craype/2.7.15).

A new stack has been built with these compilers. UPDATE: ESMF v8.4.2 is needed when using newer intel compilers; software stack has been updated to include esmf/8.4.2 and corresponding mapl/2.35.2-esmf-8.4.2.

Testing the RT control_p8: passes the regression test successfully (log attached)

source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh 
module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.2.1/modulefiles/stack
module load hpc
module load hpc-intel-classic
module load hpc-cray-mpich

Built modules could be viewed with "module list", and are showing the following (not a complete list of all the modules available on a system):

---- /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.2.1/modulefiles/mpi/intel-classic/2022.2.1/cray-mpich/7.7.20 ----
   crtm/2.4.0                fms/2022.04            (D)    ncdiag/1.0.0           pio/2.5.7    (D)
   eckit/ecmwf-1.16.0        fms/2023.01                   ncio/1.1.2             pio/2.5.10
   esmf/8.4.2-debug          hdf5/1.10.6            (D)    nemsio/2.5.4    (D)    upp/10.0.10
   esmf/8.4.2         (D)    madis/4.3                     nemsiogfs/2.5.3        wrf_io/1.2.0
   fckit/ecmwf-0.9.2         mapl/2.35.2-esmf-8.4.2        netcdf/4.7.4    (D)

---------------------- /opt/cray/pe/lmod/modulefiles/mpi/intel/19.0/aries/1.0/cray-mpich/7.0 ----------------------
   cray-hdf5-parallel/1.12.1.3        cray-parallel-netcdf/1.12.2.3
   cray-libsci/22.05.1         (L)    cray-parallel-netcdf/1.12.3.3 (D)

---- /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.2.1/modulefiles/compiler/intel-classic/2022.2.1 ----
   bacio/2.4.1               gsl/2.7.1                    libpng/1.6.37          sigio/2.3.2
   bufr/11.7.0               hdf5/1.10.6                  met/10.1.2      (D)    sp/2.3.3
   cdo/1.9.8          (D)    hpc-cray-mpich/7.7.20 (L)    metplus/4.1.3          szip/2.1.1
   g2/3.4.5                  ip/3.3.3              (D)    nccmp/1.9.1.0   (D)    udunits/2.2.28
   g2c/1.6.4                 ip/4.0.0                     nco/5.0.6       (D)    w3emc/2.9.2
   g2tmpl/1.10.2             ip2/1.1.2                    nemsio/2.5.4           w3nco/2.4.1
   gfsio/1.4.1               jasper/2.0.25                netcdf/4.7.4           wgrib2/2.0.8   (D)
   gftl-shared/v1.5.0        jpeg/9.1.0                   prod_util/1.2.2        yafyaml/v0.5.1
   grib_util/1.2.4           landsfcutil/2.4.1            sfcio/1.4.1            zlib/1.2.11

---------------------------- /opt/cray/pe/lmod/modulefiles/comnet/intel/19.0/aries/1.0 ----------------------------
   cray-mpich-abi/7.7.20    cray-mpich/7.7.20 (L)

To Reproduce:

Update ufs-weather-model/modulefiles: ufs_gaea.intel.lua to replace/include the following:

prepend_path("MODULEPATH","/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.2.1/modulefiles/stack")
load(pathJoin("hpc", os.getenv("hpc_ver") or "1.2.0"))
load(pathJoin("intel-classic", os.getenv("intel_ver") or "2022.2.1"))
load(pathJoin("hpc-intel-classic", os.getenv("hpc_intel_ver") or "2022.2.1"))
load(pathJoin("hpc-cray-mpich", os.getenv("hpc_cray_mpich_ver") or "7.7.20"))

ufs_common.lua needs to include the following software versions:

  {["esmf"]        = "8.4.2"},
  {["mapl"]        = "2.35.2-esmf-8.4.2"},

and if testing pio/2.5.10 is needed,

  {["pio"]         = "2.5.10"},

Testing the RT control_p8: passes the regression test successfully (log attached), RT directory on Gaea to view the logs: /lustre/f2/scratch/role.epic/FV3_RT/rt_17165/control_p8

The RT test runs are submitted to gaea C4 partition by default, but not to compile jobs. To submit the job to eslogin partition corresponding to C4, it needs to be sent to gaea13 or gaea15. The following could then be added to SBATCH directives in ./ufs-weather-model/tests/fv3_conf/compile_slurm.IN_gaea: #SBATCH --nodelist=gaea15

Additional context

Relevant for the Issues: https://github.com/ufs-community/ufs-weather-model/discussions/1666 https://github.com/ufs-community/ufs-weather-model/issues/1724 (closed, but not sure if resolved)

Output

natalie-perlin commented 1 year ago

ESMF support response on inquiry about ESMF/8.3.0b09 runt-time failures: ESMF v8.4.2 is needed when using newer intel compilers (such as intel-classic/2022.2.1) .

Stack has been rebuilt to include esmf/8.4.2 and corresponding mapl/2.35.2-esmf-8.4.2. This requires adapting ufs_common.lua to include these module changes. Control run control_p8 passes the regression tests, RegressionTests_gaea.intel.log file attached. More tests are underway.

The test cpld_control_p8_mixedmode does not compile with these changes, due to errors in pio/2.5.7. With the pio/2.5.10 update, the cpld_control_p8_mixedmode compiles but fails during the runtime in MAPL. Logs could be found on Gaea in /lustre/f2/scratch/role.epic/FV3_RT/rt_15257/cpld_control_p8_mixedmode/

RegressionTests_gaea.intel.log_control_p8.txt

natalie-perlin commented 1 year ago

Closing the issue at the moment, https://github.com/ufs-community/ufs-weather-model/issues/1755; focusing on the ./intel-classic-2022.0.2/ version. /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack