ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 239 forks source link

Updating hpc-stack modules and miniconda locations for Hera, Gaea, Cheyenne, Orion, Jet #1465

Closed natalie-perlin closed 1 year ago

natalie-perlin commented 1 year ago

Description

Update the locations of the hpc-stack modules and miniconda3 for compiling and running the UFS-weather-model on NOAA HPC systems, such as Hera, Gaea, Cheyenne, Orion, Jet. The modules are installed under role.epic account and placed in a common EPIC-managed space on each system. Gaea also uses the Lmod installed locally in the same common location (ufs-srweather-app/PR-352, ufs-weather-app/PR-353), and needs to run a script to initialize Lmod before loading a modulefile ufs_gaea.intel.lua. While ufs-weather model may use/require python to a lesser extent, the UFS-srweather-app relies heavily on conda environment.

For ease of maintenance of the libraries on the NOAA HPC systems, transition to new location of the modules built for both ufs-weather-model and ufs-srweather-app is needed.

Solution

Repo of the ufs-weather-model to be updated with the new version of miniconda and hpc libraries.

Udated installation locations have been used to load the modules listed in /ufs-weather-model/modulefiles/ufs_commonand build the ufs model binaries. Hera gnu compilers includ UPD. 10/20/2022: Modules for Hera and Jet have been build for the already tested compiler intel/2022.1.2. Modules for the compiler/impi intel/2022.2.0 also remained and could be used when the upgrade is needed.

UPD. 10/24/2022: Modules for Hera gnu compilers (9.2.0, 10.2.0) and different mpich/openmpi combinations, and updated netcdf/4.9.0 have been prepared. UPD. 12/07/2022: Added gnu/10.1.0-based hpc-stack on Cheyenne, by a request UPD. 12/07/2022: Added gnu/10.1.0-based hpc-stack on Cheyenne with mpt/2.22, by a request

Cheyenne Lmod has been upgraded to v.8.7.13 systemwide after system maintenance on 10/21/2022.

Alternatives

Alternative solutions could be to have the hpc libraries and modules built in separate locations for the ufs-weather-model and ufs-srweather-app. The request from EPIC management, however, was to use a common location for the all the libraries.

Related to

A PR-419 in the ufs-srweather-model already exists, and a new PR will be made to the current repo.

Updated locations to load the conda/python and hpc-modules and how to load them on all the systems:

Hera python/miniconda : module use /scratch1/NCEPDEV/nems/role.epic/miniconda3/modulefiles module load miniconda3/4.12.0

Hera intel/2022.1.2 + impi/2022.1.2 : module load intel/2022.1.2 module load impi/2022.1.2 use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2

Hera intel/2022.1.2 + impi/2022.1.2 + netcdf-c 4.9.0: module load intel/2022.1.2 module load impi/2022.1.2 use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2_ncdf49/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2

Hera gnu/9.2 + mpich/3.3.2 : module load gnu/9.2 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/9.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2

Hera gnu/10.2 + mpich/3.3.2 : module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles module load gnu/10.2.0 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2

Hera gnu/10.2 + openmpi/4.1.2 : module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles module load gnu/10.2.0 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_openmpi/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.2 module load openmpi/4.1.2 module load hpc-openmpi/4.1.2

Hera gnu/9.2 + mpich/3.3.2 + netcdf-c 4.9.0: module load gnu/9.2 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_ncdf49/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/9.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2

Hera gnu/10.2 + mpich/3.3.2 + netcdf-c/4.9.0: module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles module load gnu/10.2.0 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_ncdf49/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2

Gaea miniconda:: module use /lustre/f2/dev/role.epic/contrib/modulefiles module load miniconda3/4.12.0

Gaea intel: Lmod initialization on Gaea needs to be done first by sourcing the following script: /lustre/f2/dev/role.epic/contrib/Lmod_init.sh

module use /lustre/f2/dev/role.epic/contrib/modulefiles module load miniconda3/4.12.0

module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0/modulefiles/stack module load hpc/1.2.0 module load intel/2021.3.0 module load hpc-intel/2021.3.0 module load hpc-cray-mpich/7.7.11

Cheyenne miniconda: module use /glade/work/epicufsrt/contrib/miniconda3/modulefiles module load miniconda3/4.12.0

Cheyenne intel: module use /glade/work/epicufsrt/contrib/miniconda3/modulefiles module load miniconda3/4.12.0

module use /glade/work/epicufsrt/contrib/hpc-stack/intel2022.1/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1 module load hpc-mpt/2.25

Cheyenne gnu/10.1.0_mpt2.22: module use /glade/work/epicufsrt/contrib/hpc-stack/gnu10.1.0_mpt2.22/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.1.0 module load hpc-mpt/2.22

Cheyenne gnu/10.1.0: module use /glade/work/epicufsrt/contrib/hpc-stack/gnu10.1.0/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.1.0 module load hpc-mpt/2.25

Cheyenne gnu/11.2.0: module use /glade/work/epicufsrt/contrib/hpc-stack/gnu11.2.0/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/11.2.0 module load hpc-mpt/2.25

Orion miniconda: module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles module load miniconda3/4.12.0

Orion intel: module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles module load miniconda3/4.12.0

module use /work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2

Jet miniconda: module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/modulefiles module load miniconda3/4.12.0

Jet intel: module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/modulefiles module load miniconda3/4.12.0

module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2

NB: There were comments in ufs-weather-app/PR-419 suggesting to roll back to lower compiler versions for Cheyenne gnu (to use 11.2.0 instead of 12.1.0), Hera intel (to use intel/2021.1.2 instead of 2022.2.0), and Jet intel (to use intel/2021.1.2 instead of intel/2022.2.0)

Either way could be OK for the SRW, and the libraries would be built for the lower-version compilers as suggested

jkbk2004 commented 1 year ago

@natalie-perlin Can you make sure all compiler and library versions are confirmed against https://github.com/ufs-community/ufs-weather-model/tree/develop/modulefiles ?

jkbk2004 commented 1 year ago

@ulmononian can we coordinate about intel/gnu/openmpi to hera on this issue?

natalie-perlin commented 1 year ago

@jkbk2004 The PRs have not been made yet to address the changes in modulefiles for the ufs-weather-model, only for the ufs-srweather-app

natalie-perlin commented 1 year ago

The modulefiles for Hera and Jet to use the intel/2022.1.2 version, and not the latest 2022.2.0, version have been built. Updating the info in the top comment of this issue.

DusanJovic-NOAA commented 1 year ago

Can somebody please build the gnu hpc-stack on hera and cheyenne using openmpi. Thanks.

ulmononian commented 1 year ago

@DusanJovic-NOAA @jkbk2004 here is a build i did in the past w/ gnu-9.2.0 & openmpi-3.1.4 on hera: module use /scratch1/NCEPDEV/stmp2/Cameron.Book/hpcs_work/libs/gnu/stack_noaa/modulefiles/stack

DusanJovic-NOAA commented 1 year ago

@DusanJovic-NOAA @jkbk2004 here is a build i did in the past w/ gnu-9.2.0 & openmpi-3.1.4 on hera: module use /scratch1/NCEPDEV/stmp2/Cameron.Book/hpcs_work/libs/gnu/stack_noaa/modulefiles/stack

Thanks @ulmononian. I also have the gnu/openmpi stack built in my own space. What I was asking is the installation in officially supported location so that we can update modulefiles in develop branch.

junwang-noaa commented 1 year ago

@ulmononian would you please also create an issue hpc-stack on upp repo (https://github.com/noaa-emc/upp). Also other workflow (global workflow, HAFS workflow) may also be impacted by this change. @WenMeng-NOAA @aerorahul @WalterKolczynski-NOAA @KateFriedman-NOAA @BinLiu-NOAA FYI.

jkbk2004 commented 1 year ago

@junwang-noaa @ulmononian @WenMeng-NOAA @aerorahul @WalterKolczynski-NOAA @KateFriedman-NOAA @BinLiu-NOAA @natalie-perlin I noticed that Kyle's old stack installations are still used in other applications and some machines. I started a coordination on EPIC side. It may take a week or two to finish the full transition. I want to combine this issue with the other library update follow-ups on-going: netcdf/esmf, etc.

WenMeng-NOAA commented 1 year ago

@jkbk2004 Can you install g2tmpl/1.10.2 for the UPP? Thanks!

jkbk2004 commented 1 year ago

@jkbk2004 Can you install g2tmpl/1.10.2 for the UPP? Thanks!

@WenMeng-NOAA g2tmpl/1.10.2 is available (current ufs-wm modulefiles) but backward comparability issue was captured at issue #1441.

natalie-perlin commented 1 year ago

@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates.

The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/

Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).

DusanJovic-NOAA commented 1 year ago

@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates.

The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/

Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).

@natalie-perlin Is anyone going to provide gnu/openmpi stack?

jkbk2004 commented 1 year ago

@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates. The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/ Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).

@natalie-perlin Is anyone going to provide gnu/openmpi stack?

@ulmononian can you install gnu/openmpi parallel to the location above?

natalie-perlin commented 1 year ago

@jkbk2004 - do we need al four possible combinations for compilers (gnu/9.2.0, gnu/10.2.0) with mpich/3.3.2 , openmpi/4.1.2 ?

jkbk2004 commented 1 year ago

@jkbk2004 - do we need al four possible combinations for compilers (gnu/9.2.0, gnu/10.2.0) with mpich/3.3.2 , openmpi/4.1.2 ?

@natalie-perlin I think @ulmononian has installed gnu10.1/openmpi. That should be good enough as a starting point for openmpi option. But it makes a sense to set openmpi installation available along with the role account path.

natalie-perlin commented 1 year ago

@jkbk2004, @ulmonian - HPC-modules using different versions gnu, mpich and openmpi were installed, plus new versions of netcdf 4.9.0 (netcdf-c/4.9.0, netcdf-fortran/4.6.0, netcdf-cxx-4.3.1) for the following combinations:

gnu/9.2.0 + mpich/3.3.2 + netcdf/4.7.4 gnu/9.2.0 + mpich/3.3.2 + netcdf/4.9.0 gnu/10.2.0 + mpich/3.3.2 + netcdf/4.7.4 gnu/10.2.0 +mpich/3.3.2 + netcdf/4.9.0 gnu/10.2.0 + openmpi/4.1.2 + netcdf/4.7.4

The updates of the stack locations are made in the top comment of this Issue-1465

natalie-perlin commented 1 year ago

Added a stack build with the intel compiler and netcdf4.9 on Hera (see the list of locations in the top comment)

ulmononian commented 1 year ago

@DusanJovic-NOAA @jkbk2004 @natalie-perlin i will install the stack w/ gnu-9.2 and openmpi-3.1.4 here /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs shortly, as well as w/ gnu-10.1 & openmpi-3.1.4 in the official location.

ulmononian commented 1 year ago

@DusanJovic-NOAA @jkbk2004 @natalie-perlin hpc-stack built w/ gnu-9.2 and openmpi-3.1.4 was installed successfully here: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4.

DusanJovic-NOAA commented 1 year ago

I tried running the regression test using gnu-9.2_openmpi-3.1.4 stack but it failed because the debug version of esmf library is missing:

$ module load ufs_hera.gnu_debug
Lmod has detected the following error:  The following module(s) are
unknown: "esmf/8.3.0b09-debug"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "esmf/8.3.0b09-debug"

$ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4/modulefiles/mpi/gnu/9.2.0/openmpi/3.1.4/esmf/
total 4
-rw-r--r-- 1 role.epic nems 1365 Oct 28 23:20 8.3.0b09.lua
lrwxrwxrwx 1 role.epic nems   12 Oct 28 23:20 default -> 8.3.0b09.lua
DusanJovic-NOAA commented 1 year ago

I also tried 'gnu-10.2_openmpi' stack, but it looks like when I load it, it does not actually load gnu 10.2 module, I see:

$ module list

Currently Loaded Modules:
  1) miniconda3/3.7.3   10) libpng/1.6.37  19) g2tmpl/1.10.0
  2) sutils/default     11) hdf5/1.10.6    20) ip/3.3.3
  3) cmake/3.20.1       12) netcdf/4.7.4   21) sp/2.3.3
  4) hpc/1.2.0          13) pio/2.5.7      22) w3emc/2.9.2
  5) hpc-gnu/10.2       14) esmf/8.3.0b09  23) gftl-shared/v1.5.0
  6) openmpi/4.1.2      15) fms/2022.01    24) mapl/2.22.0-esmf-8.3.0b09
  7) hpc-openmpi/4.1.2  16) bacio/2.4.1    25) ufs_common
  8) jasper/2.0.25      17) crtm/2.4.0     26) ufs_hera.gnu
  9) zlib/1.2.11        18) g2/3.4.5

note, there is no gnu/10.2 module loaded. When I run gcc I see the compiler is version 4.8.5:

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I think this is because, in gnu-10.2_openmpi/modulefiles/core/hpc-gnu/10.2.lua, two lines:

load(compiler)
prereq(compiler)

are missing:

$ cat gnu-10.2_openmpi/modulefiles/core/hpc-gnu/10.2.lua 

...
local compiler = pathJoin("gnu",pkgVersion)

local opt = os.getenv("HPC_OPT") or os.getenv("OPT") or "/opt/modules"
local mpath = pathJoin(opt,"modulefiles/compiler","gnu",pkgVersion)
prepend_path("MODULEPATH", mpath)
...

which are present in:

$ cat gnu-9.2_openmpi-3.1.4/modulefiles/core/hpc-gnu/9.2.0.lua 

...
local compiler = pathJoin("gnu",pkgVersion)
load(compiler)
prereq(compiler)

local opt = os.getenv("HPC_OPT") or os.getenv("OPT") or "/opt/modules"
local mpath = pathJoin(opt,"modulefiles/compiler","gnu",pkgVersion)
prepend_path("MODULEPATH", mpath)
...
DusanJovic-NOAA commented 1 year ago

There is also unnecessary inconsistency in the naming of hpc-gnu module between two versions:

$ ll gnu-9.2_openmpi-3.1.4/modulefiles/core/hpc-gnu/
total 4
-rw-r--r-- 1 role.epic nems 749 Oct 28 22:07 9.2.0.lua
$ ll gnu-10.2_openmpi/modulefiles/core/hpc-gnu/
total 4
-rw-r--r-- 1 role.epic nems 717 Oct 24 12:59 10.2.lua

Why '10.2' and not '10.2.0'? Also the 9.2 stack directory name has openmpi version, while directory for 10.2 stack does not.

ulmononian commented 1 year ago

I tried running the regression test using gnu-9.2_openmpi-3.1.4 stack but it failed because the debug version of esmf library is missing:

$ module load ufs_hera.gnu_debug
Lmod has detected the following error:  The following module(s) are
unknown: "esmf/8.3.0b09-debug"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "esmf/8.3.0b09-debug"

$ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4/modulefiles/mpi/gnu/9.2.0/openmpi/3.1.4/esmf/
total 4
-rw-r--r-- 1 role.epic nems 1365 Oct 28 23:20 8.3.0b09.lua
lrwxrwxrwx 1 role.epic nems   12 Oct 28 23:20 default -> 8.3.0b09.lua

my apologies, @DusanJovic-NOAA i will install esmf/8.3.0b09-debug in /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 now and update you when it is finished. we will also address the inconsistency in naming convention and look into the gnu-10.2 modulefile. thank you for testing w/ these stacks.

ulmononian commented 1 year ago

@DusanJovic-NOAA the stack at /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 has been updated to include esmf/8.3.0b09-debug. i was able to load ufs_common_debug.lua, so hopefully it works for you now!

natalie-perlin commented 1 year ago

@DusanJovic-NOAA, @ulmononian - please note that the GNU 10.2.0 is not installed system-wide on Hera, and only installed locally in EPIC space. It could be build under the current hpc-stack for a particular compiler-gnu-netcdf installation location, but because the compiler is shared between several of such combinations, it is moved to a common location outside a given hpc-stack installation.

Please note that directions to load the compilers and stack given in the first comment address the way the compiler is loaded! For example, Hera gnu/10.2 + mpich/3.3.2 : module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles module load gnu/10.2.0 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2

natalie-perlin commented 1 year ago

The modulefiles for GNU 10.2.0 had to be manually adjusted to allow a customized location of the gnu/10.2.0 compiler, the path that is only listed when the hpc-stack is being requested to load. The stack would not find a compiler "by default", because the modulepath is not known: it neither the system-wide installation path, nor is under the given hpc-stack combo.

I hope it resolves questions about the use of GNU/10.2.0 compiler!

natalie-perlin commented 1 year ago

@DusanJovic-NOAA - as to the questions about the use of 9.2 vs. 9.2.0 or 10.2 vs. 10.2.0 - it is purely by legacy reasons. I did see that previous hpc-stack installations used XX.X abbreviations. However, you do need to give the full version of the compiler, the way it is installed system-wide, which is 9.2.0 in this case. And GNU/10.2.0 was installed in EPIC-space to match the gnu/9.2.0 convention, using XX.X.X. If there is a strong preference to get to the use of XX.X.X (as is system-wide gnu/9.2.0 install), it could relatively easily be done (reinstalled in a new location).

DusanJovic-NOAA commented 1 year ago

@DusanJovic-NOAA the stack at /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 has been updated to include esmf/8.3.0b09-debug. i was able to load ufs_common_debug.lua, so hopefully it works for you now!

@ulmononian Thanks for adding the debug build of esmf. I ran control and control_debug regression tests, both finished successfully. The control tests outputs are not bit identical to the baseline, contol_debug are identical. I guess this is expected due to different MPI library.

DusanJovic-NOAA commented 1 year ago

@DusanJovic-NOAA, @ulmononian - please note that the GNU 10.2.0 is not installed system-wide on Hera, and only installed locally in EPIC space. It could be build under the current hpc-stack for a particular compiler-gnu-netcdf installation location, but because the compiler is shared between several of such combinations, it is moved to a common location outside a given hpc-stack installation.

Please note that directions to load the compilers and stack given in the first comment address the way the compiler is loaded! For example, Hera gnu/10.2 + mpich/3.3.2 : module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles module load gnu/10.2.0 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2

@natalie-perlin I tried to run control and control_debug tests after loading gnu module form the location above (thanks for explaining this, I missed that in the description). The control test compiled successfuly, but failed at run time:

+ sleep 1                                                                                                                            
+ srun --label -n 160 ./fv3.exe                                                                                                      
  1: [h12c01:06674] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
 90: [h20c56:12037] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
 55: [h12c04:153910] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                 
144: [h21c53:84991] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
....
 38: [h12c01:06711] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112                  
 43: --------------------------------------------------------------------------                                                      
 43: The application appears to have been direct launched using "srun",                                                              
 43: but OMPI was not built with SLURM's PMI support and therefore cannot                                                            
 43: execute. There are several options for building PMI support under                                                               
 43: SLURM, depending upon the SLURM version you are using:                                                                          
 43:                                                                                                                                 
 43:   version 16.05 or later: you can use SLURM's PMIx support. This                                                                
 43:   requires that you configure and build SLURM --with-pmix.                                                                      
 43:                                                                                                                                 
 43:   Versions earlier than 16.05: you must use either SLURM's PMI-1 or                                                             
 43:   PMI-2 support. SLURM builds PMI-1 by default, or you can manually                                                             
 43:   install PMI-2. You must then build Open MPI using --with-pmi pointing                                                         
 43:   to the SLURM PMI library location.                                                                                            
 43:                                                                                                                                 
 43: Please configure as appropriate and try again.                                                                                  
DusanJovic-NOAA commented 1 year ago

Debug version of esmf is missing in gnu-10.2_openmpi stack:

$ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_openmpi/modulefiles/mpi/gnu/10.2/openmpi/4.1.2/esmf/
total 4
-rw-r--r-- 1 role.epic nems 1365 Oct 24 14:36 8.3.0b09.lua
lrwxrwxrwx 1 role.epic nems   12 Oct 24 14:36 default -> 8.3.0b09.lua
MichaelLueken commented 1 year ago

@natalie-perlin The SRW App was tested on Hera using the intel/2022.1.2 + impi/2022.1.2 + netcdf-c 4.9.0 stack. All fundamental WE2E tests successfully ran. Testing using netcdf/4.9.0 with the /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack stack location causes the SRW WE2E tests to fail while running the forecast due to illegal characters in the NetCDF files.

It would be interesting to see the differences between the two stacks and see why one version works while the other doesn't.

grantfirl commented 1 year ago

I think that this is related to this issue: https://github.com/NCAR/ccpp-physics/discussions/980

MichaelLueken commented 1 year ago

Thanks, @grantfirl! Yes, I was seeing the same issue as described in NCAR/ccpp-physics#980. It is nice to see that this won't be an issue once the stack on Hera is transitioned to @natalie-perlin's new stack.

zach1221 commented 1 year ago

Hi @MichaelLueken I've tested Natalie's instructions above for loading conda/python and hpc-modules on Hera, Gaea, Cheyenne, Orion and Jet. I did not have any issues.

MichaelLueken commented 1 year ago

Thanks, @zach1221! That's great news! Once the updated hpc-stack locations for Gaea, Cheyenne, Orion, and Jet are updated in the weather-model, @natalie-perlin will be able to update the locations in the SRW App.

zach1221 commented 1 year ago

@natalie-perlin can crtm and gftl-shared be updated to crtm/2.4.0 and gftl-shared/v1.5.0 on Jet? Currently it seems your new module stack location has only crtm/2.3.0 and gftl-shared/1.3.3.

natalie-perlin commented 1 year ago

@MichaelLueken @zach1221 - all resolved for the intel/2022.1.2 on Jet!

zach1221 commented 1 year ago

Hi, @natalie-perlin @MichaelLueken and I continue to have issues with the new stack on Jet. Have you had the chance to try running any regression tests or SRW yourself, using the new stack?

natalie-perlin commented 1 year ago

@zach1221 @MichaelLueken - Please remember to recompile/rebuild the SRW or UFS WM with the new stack.

Yes, ran the SRW tests with the new stack on Jet. The modulefile, build directory and SRW binaries:

/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/ufs-srw-hpc-noAVXs/modulefiles
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/ufs-srw-hpc-noAVXs/build
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/ufs-srw-hpc-noAVXs/exec

The four experiments:

/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_VJET_hpc_new/
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_KJET_hpc_new/
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_SJET_hpc_new/
/mnt/lfs4/HFIP/hfv3gfs/role.epic/sandbox/SRW/expt_dirs/grid_CONUScompact_25km_XJET_hpc_new/
MichaelLueken commented 1 year ago

@natalie-perlin Thanks! I was able to successfully build and run the SRW App's fundamental WE2E tests on Jet using the new HPC-stack location (the run_fcst job even ran using vjet, which would have led to the job failing previously).

natalie-perlin commented 1 year ago

@jkbk2004 - Gaea modules were not updated