ufs-community / ufs-weather-model

UFS Weather Model
Other
138 stars 249 forks source link

Esmf/pio issue on Gaea, causing failure of cpld_control_ciceC_p8 & cpld_control_c192_p8 #1683

Closed zach1221 closed 1 year ago

zach1221 commented 1 year ago

Description

When attempting to run regression test suite on Gaea cpld_control_ciceC_p8 & cpld_control_c192_p8 fail due to esmf/pio related error.

These cases along with cpld_restart_c192_p8, have been disabled for Gaea until the issue can be resolved.

To Reproduce:

What compilers/machines are you seeing this with? Intel Give explicit steps to reproduce the behavior.

  1. Log into Gaea
  2. clone ufs-weather-model repo
  3. cd into ufs-weather-model/tests
  4. ./rt.sh -n cpld_control_ciceC_p8 and cpld_control_c192_p8

Additional context

Output

Screenshots

Gaea_err

complains about esmf/pio stack libraries.

output logs If applicable, include relevant output logs. Either drag and drop the entire log file here (if a long log) or

paste the code in this type of section (if a short section of log)

-->

DusanJovic-NOAA commented 1 year ago

I ran these two tests using the current develop branch (with #1633 merged in) using updated pio and esmf (pio/2.5.10, esmf/8.4.1) and the tests passed. The hpc-stack install directory is here /lustre/f2/dev/Dusan.Jovic/hpc-stack/opt_intel_esmf_841/modulefiles

jkbk2004 commented 1 year ago

Awesome! @natalie-perlin can we follow up on this? Sounds like we need to re-install.

jkbk2004 commented 1 year ago

@natalie-perlin I mean we can make sure this issue is reflected with next round of library updates. We clearly need new pio and esmf versions. Let's try to be on same page about this issue.

natalie-perlin commented 1 year ago

@jkbk2004 , @DusanJovic-NOAA - Just to be clear on the issues, there are two things: 1) Failing of the test using the current hpc-stack, based on pio/2.5.7 and esmf/8.3.0b09 (?) 2) Need to update these libraries to pio/2.5.10 and esmf/8.4.1

Is failing of the test in (1) caused by new code requirements that need higher versions, i.e., pio/2.5.10 and esmf/8.4.1?

For (2) - there is a new installation on Gaea that use hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf49

Could you please run the test using this stack and let me know if this helps (note the new modules/versions to update in the modulefiles). The S2SWA code compiles successfully with the new stack.

zach1221 commented 1 year ago

@natalie-perlin I can test this out the new installation on Gaea and let you know if successful.

zach1221 commented 1 year ago

@natalie-perlin I'm having some issues testing with the new installation.

New installation: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf49 Note I'm testing with ufs-wm RT cases cpld_control_ciceC_p8 & cpld_control_c192_p8 My working directory logs: /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/log_gaea.intel Experiment directory: /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_12942

The compile finishes successfully however it fails right before run_test, so there's no real specific error just the below.

err3

I updated ufs_gaea.intel.lua in ufs-weather-model/modulefiles to include the new modulepath and updated ufs_common.lua to include the new versions of hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2.

natalie-perlin commented 1 year ago

@zach1221 , the error is in rt.sh (line 801) it cannot find rt_*.log files.

natalie-perlin commented 1 year ago

Description

When attempting to run regression test suite on Gaea cpld_control_ciceC_p8 & cpld_control_c192_p8 fail due to esmf/pio related error.

These cases along with cpld_restart_c192_p8, have been disabled for Gaea until the issue can be resolved.

To Reproduce:

What compilers/machines are you seeing this with? Intel Give explicit steps to reproduce the behavior.

  1. Log into Gaea
  2. clone ufs-weather-model repo
  3. cd into ufs-weather-model/tests
  4. ./rt.sh -n cpld_control_ciceC_p8 and cpld_control_c192_p8

Additional context

Output

Screenshots Gaea_err

complains about esmf/pio stack libraries.

output logs If applicable, include relevant output logs. Either drag and drop the entire log file here (if a long log) or

paste the code in this type of section (if a short section of log)

-->

Are you using the standard hpc-stack installation location in the original issue? It looks like there is indeed issue with the esmf, which points at a later esmf/8.5.0b17 installation instead of a standards esmf/8.3.0b09. Which esmf vesion are you loading in your test?

natalie-perlin commented 1 year ago

Issue has been determined and fixed. The default modulefile was pointing to a later installation default -> 8.5.0b17.lua . It is pointing to the standard version now, default -> 8.3.0b09.lua . This hopefully resolves the original issue with the stack in /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch

zach1221 commented 1 year ago

@natalie-perlin still receiving "the error is in rt.sh (line 801) it cannot find rt_*.log files." with these two cases on Gaea. Trying to troubleshoot and dig up additional info.

natalie-perlin commented 1 year ago

@zach1221 - if you are testing only the build ("compile") but not the run, you may not have any rt_*log files, which are created after the "run" phase. Maybe adding a conditional check if such files are present could help to avoid the error.

Your log file /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/log_gaea.intel/compile_001.log reports test completed: 16 min. TEST 001 compile is COMPLETED, status: - jobid 75128296

zach1221 commented 1 year ago

@natalie-perlin I'm trying to run the case as well, after compiling, but it gets caught after the compile complete with the rt_*log error. I'll investigate further and update when I've found the cause.

zach1221 commented 1 year ago

Apologies for the delay, @natalie-perlin . Tests cpld_control_ciceC_p8 & cpld_control_c192_p8 worked for me on Gaea using pio/2.5.10, and esmf/8.4.1 from @DusanJovic-NOAA's installation he mentioned above. Maybe this is the direction we should go in for updating esmf/pio on Gaea? I couldn't get the standard/current version of pio/2.5.7 & esmf/8.3.0b09 to work.

natalie-perlin commented 1 year ago

@zach1221 @DusanJovic-NOAA A new build of the hpc-stack on Gaea in EPIC location is available: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf492/ It includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2

@DusanJovic-NOAA - does your stack build use netcdf/4.9.1 or netcdf/4.9.2?..

DusanJovic-NOAA commented 1 year ago

netcdf/4.7.4

See the install directory is here /lustre/f2/dev/Dusan.Jovic/hpc-stack/opt_intel_esmf_841/modulefiles

zach1221 commented 1 year ago

Hi, @natalie-perlin I attempted the cpld_control_ciceC_p8 using your latest installation of hpc-stack on Gaea, that includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. However it's failing in the test run. Compile is successful though. Seems like issue related to mapl version possibly? image

zach1221 commented 1 year ago

I did test again with the combination of hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0, and the cases cpld_control_ciceC_p8 and cpld_control_c192_p8 pass fine. Could we use this configuration on Gaea currently or does hdf5, netcdf and mapl also need to be updated with esmf/pio?

DusanJovic-NOAA commented 1 year ago

Hi, @natalie-perlin I attempted the cpld_control_ciceC_p8 using your latest installation of hpc-stack on Gaea, that includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. However it's failing in the test run. Compile is successful though. Seems like issue related to mapl version possibly? image

The gocart failure looks similar to https://github.com/ufs-community/ufs-weather-model/issues/1629

zach1221 commented 1 year ago

Thanks, @DusanJovic-NOAA . It does look similar, and based on what I'm reading from issue 1621, there may be outstanding problem.

@natalie-perlin Could be GOCART related issue currently with running some tests using library based on (hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2). This may also give you cause to go with alternative (hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0)

natalie-perlin commented 1 year ago

@zach1221 @DusanJovic-NOAA @jkbk2004 - Preparing an additional configuration on Gaea as suggested by Zach (hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0-esmf-8.4.1)...

natalie-perlin commented 1 year ago

@zach1221 - Ready for Gaea in the stack: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/

Verifying loading the modules:

source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack/
module load hpc/1.2.0
module load hpc-intel
module load hpc-cray-mpich/7.7.11
module load hdf5/1.10.6
module load netcdf/4.7.4
module load pio/2.5.10
module load esmf/8.4.1
module load mapl/2.22.0-esmf-8.4.1
module list 

Currently Loaded Modules:
  1) modules/3.2.11.4
  2) CmrsEnv
  3) TimeZoneEDT
...
 25) hpc/1.2.0
 26) intel/2021.3.0
 27) hpc-intel/2021.3.0
 28) cray-mpich/7.7.11
 29) hpc-cray-mpich/7.7.11
 30) hdf5/1.10.6
 31) netcdf/4.7.4
 32) pio/2.5.10
 33) esmf/8.4.1
 34) mapl/2.22.0-esmf-8.4.1
zach1221 commented 1 year ago

Thanks @natalie-perlin I will test this today!

zach1221 commented 1 year ago

@natalie-perlin I've added modules/3.2.11.4 to the modulefile for Gaea, but it seems Lmod is unable to locate it. image

Here's my modulefile setup. ufs_gaea.intel.txt

natalie-perlin commented 1 year ago

@zach1221 - There is no need to add this explicitly, as it is one of the system modules. All the system modules are loaded during Lmod initialization when the command source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh in the first line of my code snippet is executed.

zach1221 commented 1 year ago

Hi, @natalie-perlin

I've removed the "modules/3.2.11.4" module from the gaea moduel file. My next attempt failed as it was unable to load esmf/8.4.1. image

Steps to reproduce.

  1. clone ufs-wm community:dev repo
  2. cd ufs-weather-model/modulefiles
  3. edit ufs_common.lua change version numbers of mapl to "mapl/2.22.0-esmf-8.4.1", pio to "pio/2.5.10", and esmf to "esmf/8.4.1".
  4. edit ufs_gaea.intel.lua to add module path "/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack"
  5. cd ufs-weather-model/tests
  6. enable tests cpld_control_ciceC_p8 & cpld_control_c192_p8 to run on Gaea by editing rt.conf
natalie-perlin commented 1 year ago

@zach1221

Please verify that the Lmod initialization is run in Gaea before the modules are loaded. You could test that the modules are loaded properly, after your steps 1-3, as following:

export MACHINE_ID=gaea
source tests/module-setup.sh
module use modulefiles
module load ufs_gaea.intel

You could also build a code needed for cpld_control_ciceC_p8 test, which builds with no issues

export CMAKE_FLAGS="-DAPP=S2SWA 
 -DCCPP_SUITES=FV3_GFS_v17_coupled_p8,FV3_GFS_cpld_rasmgshocnsstnoahmp_ugwp"
 ./build.sh

It gives some kind of warnings in the end of the build, but it builds the executable:

ld: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/intel-2021.3.0/cray-mpich-7.7.11/esmf/8.4.1/lib/libesmf.a(ESMCI_MethodTable.o): in function `ESMCI::MethodElement::resolve()':
/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-2021.3.0_noarch/pkg/v8.4.1/src/Superstructure/Component/src/ESMCI_MethodTable.C:400: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
ld: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/intel-2021.3.0/cray-mpich-7.7.11/esmf/8.4.1/lib/libesmf.a(ESMCI_VMKernel.o): in function `ESMCI::socketClientInit(char const*, int, double)':
/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-2021.3.0_noarch/pkg/v8.4.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:7785: warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
[100%] Built target ufs_model
natalie-perlin commented 1 year ago

@zach1221 for testing purposes on Gaea, instead of sourcing module-setup.sh, you could just source the Lmod initialize script, and then load the ufs_gaea.intel:

source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
use <ufs-weather-model>/modulefiles
module load ufs_gaea.intel
natalie-perlin commented 1 year ago

Hi, @natalie-perlin

I've removed the "modules/3.2.11.4" module from the gaea moduel file. My next attempt failed as it was unable to load esmf/8.4.1. image

Steps to reproduce.

  1. clone ufs-wm community:dev repo
  2. cd ufs-weather-model/modulefiles
  3. edit ufs_common.lua change version numbers of mapl to "mapl/2.22.0-esmf-8.4.1", pio to "pio/2.5.10", and esmf to "esmf/8.4.1".
  4. edit ufs_gaea.intel.lua to add module path "/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack"
  5. cd ufs-weather-model/tests
  6. enable tests cpld_control_ciceC_p8 & cpld_control_c192_p8 to run on Gaea by editing rt.conf

Started rt.sh job as well, the compilation is finished successfully, see /lustre/f2/dev/role.epic/sandbox/UFS-WM/ufs-wm-dev/tests/log_gaea.intel/compile_001.log

zach1221 commented 1 year ago

@natalie-perlin I've got it working now. I'll update you here as soon as the tests pass.

zach1221 commented 1 year ago

@natalie-perlin latest stack installation at /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/ seems to have resolved the issue regarding cpld_control_ciceC_p8 & cpld_control_c192_p8 on Gaea. fyi @jkbk2004

jiandewang commented 1 year ago

@natalie-perlin thanks for the information, let me try again when GAEA is back

natalie-perlin commented 1 year ago

@zach1221 - Please update the status for the tests on Gaea A note regarding WM Issue-1724 (https://github.com/ufs-community/ufs-weather-model/issues/1724): it is a separate issue, and requires new stacks to be built with the compilers available on both C3 and C4 partitions.

jkbk2004 commented 1 year ago

@natalie-perlin is intel/2022.2.1 available on both of C4 and C5?

zach1221 commented 1 year ago

@jkbk2004 @natalie-perlin cpld_control_ciceC_p8 & cpld_control_c192_p8 fail on c4 due to the inability to load intel/2021.3.0. I'm testing on c3 now.

zach1221 commented 1 year ago

Does c3 partition doesn't have the resources to run these tests? I receive below when attempting cpld_control_ciceC_p8 & cpld_control_c192_p8 on c3. image

jkbk2004 commented 1 year ago

Does c3 partition doesn't have the resources to run these tests? I receive below when attempting cpld_control_ciceC_p8 & cpld_control_c192_p8 on c3. image

C3 different architecture

jkbk2004 commented 1 year ago

It sounds like intel-2022.0.2/classic/oneapi will be most practical option.

natalie-perlin commented 1 year ago

@natalie-perlin is intel/2022.2.1 available on both of C4 and C5?

C4 (gaea13 check): intel-classic/2022.2.1 C5 (gaea55 check): intel-classic/2022.2.1

Yes, same name on both C4 and C5 partitions

zach1221 commented 1 year ago

Ok, I can confirm that cpld_control_ciceC_p8 passes now when testing on ufs-community : develop. However, cpld_control_c192_p8 still fails with the below error. err log here: /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_20605/cpld_control_c192_p8 image

zach1221 commented 1 year ago

Both cpld_control_ciceC_p8 and cpld_control_c192_p8 run successfully now on Gaea. Logs: /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/logs/RegressionTests_gaea.log

Gaea has been re-enabled for these two tests in ufs-wm #1912 .