Closed zach1221 closed 1 year ago
I ran these two tests using the current develop branch (with #1633 merged in) using updated pio and esmf (pio/2.5.10, esmf/8.4.1) and the tests passed. The hpc-stack install directory is here /lustre/f2/dev/Dusan.Jovic/hpc-stack/opt_intel_esmf_841/modulefiles
Awesome! @natalie-perlin can we follow up on this? Sounds like we need to re-install.
@natalie-perlin I mean we can make sure this issue is reflected with next round of library updates. We clearly need new pio and esmf versions. Let's try to be on same page about this issue.
@jkbk2004 , @DusanJovic-NOAA - Just to be clear on the issues, there are two things: 1) Failing of the test using the current hpc-stack, based on pio/2.5.7 and esmf/8.3.0b09 (?) 2) Need to update these libraries to pio/2.5.10 and esmf/8.4.1
Is failing of the test in (1) caused by new code requirements that need higher versions, i.e., pio/2.5.10 and esmf/8.4.1?
For (2) - there is a new installation on Gaea that use hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf49
Could you please run the test using this stack and let me know if this helps (note the new modules/versions to update in the modulefiles). The S2SWA code compiles successfully with the new stack.
@natalie-perlin I can test this out the new installation on Gaea and let you know if successful.
@natalie-perlin I'm having some issues testing with the new installation.
New installation: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf49 Note I'm testing with ufs-wm RT cases cpld_control_ciceC_p8 & cpld_control_c192_p8 My working directory logs: /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/log_gaea.intel Experiment directory: /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_12942
The compile finishes successfully however it fails right before run_test, so there's no real specific error just the below.
I updated ufs_gaea.intel.lua in ufs-weather-model/modulefiles to include the new modulepath and updated ufs_common.lua to include the new versions of hdf5/1.14.0 + netcdf/4.9.1 +pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2.
@zach1221 , the error is in rt.sh (line 801) it cannot find rt_*.log files.
Description
When attempting to run regression test suite on Gaea cpld_control_ciceC_p8 & cpld_control_c192_p8 fail due to esmf/pio related error.
These cases along with cpld_restart_c192_p8, have been disabled for Gaea until the issue can be resolved.
To Reproduce:
What compilers/machines are you seeing this with? Intel Give explicit steps to reproduce the behavior.
- Log into Gaea
- clone ufs-weather-model repo
- cd into ufs-weather-model/tests
- ./rt.sh -n cpld_control_ciceC_p8 and cpld_control_c192_p8
Additional context
Output
Screenshots
complains about esmf/pio stack libraries.
output logs If applicable, include relevant output logs. Either drag and drop the entire log file here (if a long log) or
paste the code in this type of section (if a short section of log)
-->
Are you using the standard hpc-stack installation location in the original issue? It looks like there is indeed issue with the esmf, which points at a later esmf/8.5.0b17 installation instead of a standards esmf/8.3.0b09. Which esmf vesion are you loading in your test?
Issue has been determined and fixed. The default modulefile was pointing to a later installation default -> 8.5.0b17.lua
. It is pointing to the standard version now,
default -> 8.3.0b09.lua
.
This hopefully resolves the original issue with the stack in
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch
@natalie-perlin still receiving "the error is in rt.sh (line 801) it cannot find rt_*.log files." with these two cases on Gaea. Trying to troubleshoot and dig up additional info.
@zach1221 - if you are testing only the build ("compile") but not the run, you may not have any rt_*log files, which are created after the "run" phase. Maybe adding a conditional check if such files are present could help to avoid the error.
Your log file /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/log_gaea.intel/compile_001.log reports test completed:
16 min. TEST 001 compile is COMPLETED, status: - jobid 75128296
@natalie-perlin I'm trying to run the case as well, after compiling, but it gets caught after the compile complete with the rt_*log error. I'll investigate further and update when I've found the cause.
Apologies for the delay, @natalie-perlin . Tests cpld_control_ciceC_p8 & cpld_control_c192_p8 worked for me on Gaea using pio/2.5.10, and esmf/8.4.1 from @DusanJovic-NOAA's installation he mentioned above. Maybe this is the direction we should go in for updating esmf/pio on Gaea? I couldn't get the standard/current version of pio/2.5.7 & esmf/8.3.0b09 to work.
@zach1221 @DusanJovic-NOAA
A new build of the hpc-stack on Gaea in EPIC location is available:
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_ncdf492/
It includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2
@DusanJovic-NOAA - does your stack build use netcdf/4.9.1 or netcdf/4.9.2?..
netcdf/4.7.4
See the install directory is here /lustre/f2/dev/Dusan.Jovic/hpc-stack/opt_intel_esmf_841/modulefiles
Hi, @natalie-perlin I attempted the cpld_control_ciceC_p8 using your latest installation of hpc-stack on Gaea, that includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. However it's failing in the test run. Compile is successful though. Seems like issue related to mapl version possibly?
I did test again with the combination of hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0, and the cases cpld_control_ciceC_p8 and cpld_control_c192_p8 pass fine. Could we use this configuration on Gaea currently or does hdf5, netcdf and mapl also need to be updated with esmf/pio?
Hi, @natalie-perlin I attempted the cpld_control_ciceC_p8 using your latest installation of hpc-stack on Gaea, that includes hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2. However it's failing in the test run. Compile is successful though. Seems like issue related to mapl version possibly?
The gocart failure looks similar to https://github.com/ufs-community/ufs-weather-model/issues/1629
Thanks, @DusanJovic-NOAA . It does look similar, and based on what I'm reading from issue 1621, there may be outstanding problem.
@natalie-perlin Could be GOCART related issue currently with running some tests using library based on (hdf5/1.14.0 + netcdf/4.9.2 + pio/2.5.10 + esmf/8.4.1 + mapl/2.35.2). This may also give you cause to go with alternative (hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0)
@zach1221 @DusanJovic-NOAA @jkbk2004 - Preparing an additional configuration on Gaea as suggested by Zach (hdf5/1.10.6 + netcdf/4.7.4 +pio/2.5.10 + esmf/8.4.1 + mapl/2.22.0-esmf-8.4.1)...
@zach1221 - Ready for Gaea in the stack:
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/
Verifying loading the modules:
source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack/
module load hpc/1.2.0
module load hpc-intel
module load hpc-cray-mpich/7.7.11
module load hdf5/1.10.6
module load netcdf/4.7.4
module load pio/2.5.10
module load esmf/8.4.1
module load mapl/2.22.0-esmf-8.4.1
module list
Currently Loaded Modules:
1) modules/3.2.11.4
2) CmrsEnv
3) TimeZoneEDT
...
25) hpc/1.2.0
26) intel/2021.3.0
27) hpc-intel/2021.3.0
28) cray-mpich/7.7.11
29) hpc-cray-mpich/7.7.11
30) hdf5/1.10.6
31) netcdf/4.7.4
32) pio/2.5.10
33) esmf/8.4.1
34) mapl/2.22.0-esmf-8.4.1
Thanks @natalie-perlin I will test this today!
@natalie-perlin I've added modules/3.2.11.4 to the modulefile for Gaea, but it seems Lmod is unable to locate it.
Here's my modulefile setup. ufs_gaea.intel.txt
@zach1221 -
There is no need to add this explicitly, as it is one of the system modules. All the system modules are loaded during Lmod initialization when the command source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
in the first line of my code snippet is executed.
Hi, @natalie-perlin
I've removed the "modules/3.2.11.4" module from the gaea moduel file. My next attempt failed as it was unable to load esmf/8.4.1.
Steps to reproduce.
@zach1221
Please verify that the Lmod initialization is run in Gaea before the modules are loaded. You could test that the modules are loaded properly, after your steps 1-3, as following:
export MACHINE_ID=gaea
source tests/module-setup.sh
module use modulefiles
module load ufs_gaea.intel
You could also build a code needed for cpld_control_ciceC_p8 test, which builds with no issues
export CMAKE_FLAGS="-DAPP=S2SWA
-DCCPP_SUITES=FV3_GFS_v17_coupled_p8,FV3_GFS_cpld_rasmgshocnsstnoahmp_ugwp"
./build.sh
It gives some kind of warnings in the end of the build, but it builds the executable:
ld: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/intel-2021.3.0/cray-mpich-7.7.11/esmf/8.4.1/lib/libesmf.a(ESMCI_MethodTable.o): in function `ESMCI::MethodElement::resolve()':
/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-2021.3.0_noarch/pkg/v8.4.1/src/Superstructure/Component/src/ESMCI_MethodTable.C:400: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
ld: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/intel-2021.3.0/cray-mpich-7.7.11/esmf/8.4.1/lib/libesmf.a(ESMCI_VMKernel.o): in function `ESMCI::socketClientInit(char const*, int, double)':
/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-2021.3.0_noarch/pkg/v8.4.1/src/Infrastructure/VM/src/ESMCI_VMKernel.C:7785: warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
[100%] Built target ufs_model
@zach1221 for testing purposes on Gaea, instead of sourcing module-setup.sh, you could just source the Lmod initialize script, and then load the ufs_gaea.intel:
source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh
use <ufs-weather-model>/modulefiles
module load ufs_gaea.intel
Hi, @natalie-perlin
I've removed the "modules/3.2.11.4" module from the gaea moduel file. My next attempt failed as it was unable to load esmf/8.4.1.
Steps to reproduce.
- clone ufs-wm community:dev repo
- cd ufs-weather-model/modulefiles
- edit ufs_common.lua change version numbers of mapl to "mapl/2.22.0-esmf-8.4.1", pio to "pio/2.5.10", and esmf to "esmf/8.4.1".
- edit ufs_gaea.intel.lua to add module path "/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/modulefiles/stack"
- cd ufs-weather-model/tests
- enable tests cpld_control_ciceC_p8 & cpld_control_c192_p8 to run on Gaea by editing rt.conf
Started rt.sh job as well, the compilation is finished successfully, see
/lustre/f2/dev/role.epic/sandbox/UFS-WM/ufs-wm-dev/tests/log_gaea.intel/compile_001.log
@natalie-perlin I've got it working now. I'll update you here as soon as the tests pass.
@natalie-perlin latest stack installation at /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0_noarch/ seems to have resolved the issue regarding cpld_control_ciceC_p8 & cpld_control_c192_p8 on Gaea. fyi @jkbk2004
@natalie-perlin thanks for the information, let me try again when GAEA is back
@zach1221 - Please update the status for the tests on Gaea A note regarding WM Issue-1724 (https://github.com/ufs-community/ufs-weather-model/issues/1724): it is a separate issue, and requires new stacks to be built with the compilers available on both C3 and C4 partitions.
@natalie-perlin is intel/2022.2.1 available on both of C4 and C5?
@jkbk2004 @natalie-perlin cpld_control_ciceC_p8 & cpld_control_c192_p8 fail on c4 due to the inability to load intel/2021.3.0. I'm testing on c3 now.
Does c3 partition doesn't have the resources to run these tests? I receive below when attempting cpld_control_ciceC_p8 & cpld_control_c192_p8 on c3.
Does c3 partition doesn't have the resources to run these tests? I receive below when attempting cpld_control_ciceC_p8 & cpld_control_c192_p8 on c3.
C3 different architecture
It sounds like intel-2022.0.2/classic/oneapi will be most practical option.
@natalie-perlin is intel/2022.2.1 available on both of C4 and C5?
C4 (gaea13 check): intel-classic/2022.2.1 C5 (gaea55 check): intel-classic/2022.2.1
Yes, same name on both C4 and C5 partitions
Ok, I can confirm that cpld_control_ciceC_p8 passes now when testing on ufs-community : develop. However, cpld_control_c192_p8 still fails with the below error. err log here: /lustre/f2/scratch/Zachary.Shrader/FV3_RT/rt_20605/cpld_control_c192_p8
Both cpld_control_ciceC_p8 and cpld_control_c192_p8 run successfully now on Gaea. Logs: /lustre/f2/pdata/ncep/Zachary.Shrader/ufs-weather-model/tests/logs/RegressionTests_gaea.log
Gaea has been re-enabled for these two tests in ufs-wm #1912 .
Description
When attempting to run regression test suite on Gaea cpld_control_ciceC_p8 & cpld_control_c192_p8 fail due to esmf/pio related error.
These cases along with cpld_restart_c192_p8, have been disabled for Gaea until the issue can be resolved.
To Reproduce:
What compilers/machines are you seeing this with? Intel Give explicit steps to reproduce the behavior.
Additional context
Output
Screenshots
complains about esmf/pio stack libraries.
output logs If applicable, include relevant output logs. Either drag and drop the entire log file here (if a long log) or
-->