ufs-community / ufs-weather-model

UFS Weather Model
Other
140 stars 247 forks source link

p8 / p5 tag issue on gaea: CPC experiment support #1755

Closed jkbk2004 closed 8 months ago

jkbk2004 commented 1 year ago

Description

Solution

jieshunzhu commented 1 year ago

Thanks for your help in advance.

I git clone tags/Prototype-P8 on Gaea. In the tests/, I replaced module-setup.sh with the one in the develop branch. The main difference is "source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh". But when I compiled it, I got the following error, ++++++++++++ "Lmod has detected the following error: The following module(s) are unknown: "intel/2021.3.0" "gcc/8.3.0" "intel/18.0.6.288" "PrgEnv-intel/6.0.5" "cray-python/3.7.3.2" ++++++++++++

My directory is /lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_22995/compile_001

In addition, my default shell is tsch. Before compiling, I changed it to bash by typing "bash".

natalie-perlin commented 1 year ago

@jieshunzhu - Following the recent Gaea updates, the modules "intel/2021.3.0" "gcc/8.3.0" "intel/18.0.6.288" "PrgEnv-intel/6.0.5" "cray-python/3.7.3.2" are no longer available on neither C3 nor C4 partitions. Please see notes on stack changes for Gaea in WM-issue #1753

jieshunzhu commented 1 year ago

@natalie-perlin Thanks for it. I am looking at #1753. BTW, I was able to compile the UFS develop (the version of 20230515) branch. Do you know where else I should modify in P8, other than module-setup.sh.

jieshunzhu commented 1 year ago

I replaced /modulefiles with the one in develop branch. When compiling, I got the error about w3nco (/lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_12517/compile_001/err) ++++++ CMake Error at CMakeLists.txt:135 (find_package): By not providing "Findw3nco.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "w3nco", but CMake did not find one.

Could not find a package configuration file provided by "w3nco" (requested version 2.4.0) with any of the following names:

w3ncoConfig.cmake
w3nco-config.cmake

Add the installation prefix of "w3nco" to CMAKE_PREFIX_PATH or set "w3nco_DIR" to a directory containing one of the above files. If "w3nco" provides a separate development package or SDK, be sure it has been installed. +++++

In CMakeLists.txt of P8, I found "find_package(w3nco 2.4.0 REQUIRED)". But in the same file of develop branch, I found "find_package(w3emc 2.9.2 REQUIRED)". Are w3nco and w3emc replaceable?

natalie-perlin commented 1 year ago

@jkbk2004 - Please note that the stack /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/ was built on Gaea C3 partition before the C3 upgrade, but after the C4 upgrade, and the stack attempted to address different compilers, mpich, and Cray programming env. modules on C3 and C4. After the C3 upgrade, modules on C3 and C4 appear to be identical; some questions could remain regarding that "intermediate stack" could be fully used.

The main difference is that before the C3 upgrade, the UFS weather-model compile jobs in regression tests were built on Gaea C3 login node, which would then use the same compilers and Cray prog. environment as used during the hpc-stack build time. Only the RT test binaries were run on C4.

After the C3 upgrade, the RT weather-model compile jobs use different modules and prog. environment from the time the ./hpc-stack/intel-2022.0.2/ was built. (It may or may not create issues during the runtime.)

natalie-perlin commented 1 year ago

An updated stack had been prepared with the same compilers as for ./hpc-stack/intel-2022.0.2/, now adapted for the upgraded C3 and C4 as following: ./hpc-stack/intel-classic-2022.0.2/ The ufs_gaea.intel.lua module loads the stack as following:

[RegressionTests_gaea.intel.log.txt](https://github.com/ufs-community/ufs-weather-model/files/11508541/RegressionTests_gaea.intel.log.txt)

prepend_path("MODULEPATH","/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack")
load(pathJoin("hpc", os.getenv("hpc_ver") or "1.2.0"))

load(pathJoin("intel-classic", os.getenv("intel_classic_ver") or "2022.0.2"))
load(pathJoin("cray-mpich", os.getenv("cray_mpich_ver") or "7.7.20"))
load(pathJoin("hpc-intel-classic", os.getenv("hpc_intel_classic_ver") or "2022.0.2"))
load(pathJoin("hpc-cray-mpich", os.getenv("hpc_cray_mpich_ver") or "7.7.20"))
load(pathJoin("libpng", os.getenv("libpng_ver") or "1.6.37"))

A subset of regression tests (from # ATM tests line untill the end of the list in rt.conf) has finished successfully, log attached. model setup: /lustre/f2/dev/role.epic/sandbox/UFS-WM/ufs-wm-dev1/tests RT run directory: /lustre/f2/scratch/role.epic/FV3_RT/rt_32501

Closing the issue https://github.com/ufs-community/ufs-weather-model/issues/1753 at the moment, which was for stack for the higher-version compilers. RegressionTests_gaea.intel.log.txt

natalie-perlin commented 1 year ago

@jkbk2004 @zach1221 All the regression tests have passed on Gaea with the stack /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack/

The logs from the remaining set of regression test (coupled) is attached. RegressionTests_gaea.intel.log2.txt

jkbk2004 commented 1 year ago

@natalie-perlin can you add yafyaml/v0.5.1 to /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/stack ? We used to use yafyaml with p8 tag @jieshunzhu is trying to use.

jieshunzhu commented 1 year ago

By using intel-2022.0.2 and some other minor changes, I was able to compile P8 tag. For the regression tests, however, it missed baseline. Same thing happened for tag GFSv17.HR1.

I will try intel-classic-2022.0.2 @natalie-perlin pointed.

jkbk2004 commented 1 year ago

I agree baselines for those tags might be missing during OS transition. But we can compare a few cases with creating new baselines with tag. Compiler change is likely to cause some change at white noise level. We can confirm manually.

natalie-perlin commented 1 year ago

@jkbk2004 yafyaml/v0.5.1 is already part of the stack, for both hpc-intel/2022.0.2 and hpc-intel-classic/2022.0.2 Please let me know what might be missing. Is a different module name needed? ( v0.5.1 as opposed to 0.5.1)?

That's what you find when loading the ./intel-2022.0.2/ stack:

module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/stack
module load hpc
module load hpc-intel/2022.0.2
module show yafyaml
module avail yafyaml
------ /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/compiler/intel/2022.0.2 ------
   yafyaml/v0.5.1 (L)

... and when loading the ./intel-classic-2022.0.2/ stack:

module unload yafyaml hpc-intel/2022.0.2 hpc 
 module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack
 module load hpc
 module load hpc-intel-classic
 module load yafyaml
 module avail yafyaml
  /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/compiler/intel-classic/2022.0.2 
   yafyaml/v0.5.1 (L)

UPD: Links created, modules are loadable either way, yafyaml/v0.5.1 or yafyaml/0.5.1.

jieshunzhu commented 1 year ago

Even though it might not matter for me (because I have got P8 and HR1 complied by using intel-2022.0.2), I want to give you the update about HR1 compilation with intel-classic-2022.0.2. I got the error related to ESMF library. +++++++++++++++++++++++++++++++++++ CMake Warning at CMakeModules/Modules/FindESMF.cmake:114 (message): ESMFMKFILE does not exist Call Stack (most recent call first): CMakeLists.txt:122 (find_package)

CMake Error at /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhh

bcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find ESMF (missing: ESMF_LIBRARY_LOCATION ESMF_INTERFACE_LINK_LIBRARIES ESMF_F90COMPILEPATHS) (Required is at least version "8.3.0") Call Stack (most recent call first): /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) CMakeModules/Modules/FindESMF.cmake:121 (find_package_handle_standard_args) CMakeLists.txt:122 (find_package) +++++++++++++++++++++++++++++++++++++++

More details are seen in /lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_7661/compile_001 My source code directory is /lustre/f2/dev/ncep/JieShun.Zhu/HR1/ufs-weather-model

natalie-perlin commented 1 year ago

@jieshunzhu - looking into it now! Regression testing has been passing successfully, in another round of full-suite of test, however (see https://github.com/ufs-community/ufs-weather-model/pull/1758)

natalie-perlin commented 1 year ago

@jieshunzhu - it doesn't look like you have hpc-cray-mpich module loaded... The modulefile /lustre/f2/dev/ncep/JieShun.Zhu/HR1/ufs-weather-model/modulefiles/ufs_gaea.intel.lua does not have all the modifications needed to load hpc-cray-mpich, as suggested in https://github.com/ufs-community/ufs-weather-model/issues/1755#issuecomment-1553042387

It needs to have the following:

load(pathJoin("cray-mpich", os.getenv("cray_mpich_ver") or "7.7.20"))

load(pathJoin("hpc-cray-mpich", os.getenv("hpc_cray_mpich_ver") or "7.7.20"))

jieshunzhu commented 1 year ago

@natalie-perlin Thanks for the quick response. Got your idea. Let me try it again. I will update soon.

jieshunzhu commented 1 year ago

@natalie-perlin now both compilation and regression tests are done, but regression tests miss baseline.

jkbk2004 commented 1 year ago

@natalie-perlin now both compilation and regression tests are done, but regression tests miss baseline.

Do you wan us to create baseline with the code you are testing? so that we can continue to follow on as you move.

jieshunzhu commented 1 year ago

@jkbk2004 not necessary if you are busy on other projects. Thanks for the help. Really appreciate it.

jieshunzhu commented 1 year ago

@jkbk2004 @natalie-perlin Could you please reopen the issue?

It looks like someone removed the hpc-stack which I used for building P8 months ago: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/stack

Now, I tried to rebuild it with /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/ufs-pio-2.5.10/install/modulefiles/Core. With the spack-stack-1.4.1-c4, I can compile develop branch.

But when building P8, I got errors about "PIO". Could you please help me take a look at it? +++++++++++++++++++++++++++ CMake Error at /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/unified-env/install/intel/2022.0.2/cmake-3.23.1-gteb7td/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find PIO (missing: C Fortran) (Required is at least version "2.5.3") Call Stack (most recent call first): /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/unified-env/install/intel/2022.0.2/cmake-3.23.1-gteb7td/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) CMakeModules/Modules/FindPIO.cmake:184 (find_package_handle_standard_args) CMakeLists.txt:130 (find_package) ++++++++++++++++++++++++++++++

jieshunzhu commented 1 year ago

Forgot to mention my directory with the error message: /lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_42113/compile_001

natalie-perlin commented 1 year ago

@jkbk2004 @jieshunzhu - Please note that hpc-stack location that you mentioned most likely belonged to the Gaea (c4) before the upgrade, and they would not work on a current Gaea system. The following post-upgrade hpc-stacks are currently available on Gaea: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2_ncdf492/modulefiles/stack /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2023.1.0/modulefiles/stack

They differ in the versions of hdf5, netcdf, esmf, and mapl. The first location listed above uses hdf5/1.10.5, netcdf/4.7.4, esmf/8.3.0b09, and mapl/2.22.0. The other two locations have hdf5/1.14.0, netcdf/4.9.2, esmf/8.4.2, and mapl/2.35.2.

Spack-stack available is in /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/unified-env/install/modulefiles/Core and also under /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/ufs-pio-2.5.10/install/modulefiles/Core, specifically for parallelio/2.5.10.

Which ones would you want to test?

jkbk2004 commented 1 year ago

In my opinion, it's worth trying for CPC developers to install own hpc stack version on C4 with /lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-classic-2022.0.2. Since CPC experiment is specifically based on the p8 tag dated around Aug 2022.

natalie-perlin commented 1 year ago

@jkbk2004 - It would be great to understand the need of the developers and to get feedback from @jieshunzhu; it is very likely all the libraries needed are already installed and tested in these locations!

jieshunzhu commented 1 year ago

@natalie-perlin Hi, Natalie, Thanks for the information. As I just told @jkbk2004 Jong, we also tried to maintain our another run with ufsp5. So, the best way for me might be to install hpc-stack in my own directory with a specific version for p5. I will give it a try and share my updates here.

jieshunzhu commented 1 year ago

@jkbk2004 @natalie-perlin when I tried to install hpc-stack using the source code at /lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-classic-2022.0.2, I got some errors.

1) the first problem is about prod_util (/lustre/f2/dev/ncep/JieShun.Zhu/util/hpc-stack/src-intel-classic-2022.0.2/outZJS). It says it cannot create directory /prod_util/2.0.14. So I created a directory in advance intel-classic-2022.0.2/prod_util, then the problem is fixed.

2) the second problem is about grib_util (/lustre/f2/dev/ncep/JieShun.Zhu/util/hpc-stack/src-intel-classic-2022.0.2/outZJS2). I donot know how to fix this problem. Could you please give me some instruction about it? Thanks.

natalie-perlin commented 1 year ago

@jieshunzhu - Some of Gaea modules were installed later or scripts were adapted at a later stage, and the source , and several modules added at a later time, and few update build scripts are this location as well:

/lustre/f2/dev/role.epic/contrib/hpc-stack/src-intel-classic-2022.0.2x/lib

The most updated version of the libraries are currently in PRs to both https://github.com/NOAA-EMC/hpc-stack and https://github.com/NOAA-EPIC/hpc-stack repositories. However, NOAA-EMC is no longer officially supports hpc-stack. So the best way to find most update scripts is to follow PR-14 in https://github.com/NOAA-EPIC/hpc-stack/pull/14

Adjusting permissions for prod_utils installation is taken care in the updated build_nceplibs.sh.

As to grib_utils, it requires ip/3.3.3, and would not compile with ip/4.0.0.

jieshunzhu commented 1 year ago

@natalie-perlin Thanks. Let me try that.

jieshunzhu commented 1 year ago

I am now able to install hpc-stack based on both src-intel-classic-2022.0.2 and src-intel-classic-2022.0.2x. I am trying to compile P8 with them.

jieshunzhu commented 1 year ago

@jkbk2004 @natalie-perlin I am able to compile and run P8 using hpc-stack built with src-intel-classic-2022.0.2. Thanks for your helps. BTW, can we keep the ticket open so that I can ask for helps here? Really appreciate it.

jkbk2004 commented 1 year ago

@jieshunzhu Great progress! Either way is fine to keep the issue open or close and reopen.

jieshunzhu commented 1 year ago

@jkbk2004 @natalie-perlin BTW, if hpc-stack for C5 is ready, can you please let me know? I would like to test it.

jieshunzhu commented 1 year ago

Hi @jkbk2004 @natalie-perlin, I have a question about /lustre/f2/dev/role.epic/contrib/Lmod_init.sh. It looks like this file does not change the module list to the one (default list) when I initially log in. Instead, if I already have additional modules loaded (new list, which is different from the default list), after running Lmod_init.sh, it will keep the new list. Does Lmod_init.sh work in this way as it should? Thanks.

natalie-perlin commented 1 year ago

Hi @jieshunzhu @jkbk2004 - The Lmod initialization script /lustre/f2/dev/role.epic/contrib/Lmod_init.sh is meant to clean the environment and to load only default modules. What is your current shell (echo $0), login shell (echo $SHELL), and how would I reproduce the error you see?

natalie-perlin commented 1 year ago

@jieshunzhu - Have you tested the current stacks that exist on Gaea? Are there anything else that is needed for your besides the currently available (older, newer modules)?

Two hpc-stacks are: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack (built with intel-classic/2022.0.2 and cray-mpich/7.7.20 modules) and /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2023.1.0/modulefiles/stack built with intel-classic/2023.1.0 and cray-mpich/7.7.20 modules)

jieshunzhu commented 1 year ago

Hi @natalie-perlin my current shell is bash, and login shell is bash as well (/bin/bash). My question about Lmod_init.sh could be reproduced by the following steps. 1)after logging in, "module list" to see my default modules; 2)module load additional module, eg., nco, that is not in the default list. (confirmed by "module list") 3)source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh, and "module list" (you will see nco is still there).

jieshunzhu commented 1 year ago

@natalie-perlin I havenot tested the one with intel-classic-2023.1.0, but the intel-classic-2022.0.2 one works good. Yes, I need to ask you and Jong for more helps in setting a proper stack for (1) UFSp5 and (2) NG-GODAS.

natalie-perlin commented 1 year ago

@jieshunzhu - for Gaea C5, both hpc-stack and spack-stack/1.4.1 are available with intel-classic-2023.1.0 compilers.

/lustre/f2/dev/role.epic/contrib/C5/hpc-stack/intel-classic-2023.1.0/modulefiles/stack and /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/c5/spack-stack-1.4.1/envs/unified-env-intel-2023.1.0/install/modulefiles/Core

natalie-perlin commented 1 year ago

Some experience from integrating these stacks with intel-classic-2023.1.0 compilers to Gaea-C5 (PR for the spack-stack is https://github.com/ufs-community/ufs-srweather-app/pull/941 ) :

1) Compile environment: unload module darshan-runtime/3.4.0 (when using hpc-stack); darshan-runtime/3.4.0 and cray-pmi/6.1.10 (when using spack-stack). These combinations show what actually worked for the SRW testing. 2) Runtime environment: You may or may not need to have module darshan-runtime/3.4.0 during a runtime. When this module is not loaded, runtime errors during shutting down the MPI communications (MPI_Finalize) may occur. In this case, load the darshan-runtime/3.4.0 to the runtime environment.

jieshunzhu commented 1 year ago

Thanks @natalie-perlin. I will give them a try and share my questions/updates here.

jieshunzhu commented 1 year ago

Hi @natalie-perlin, I tried to use C5 spack-stack you pointed above to compile UFSp8, and failed. Here are my modulefiles with updated ufs_gaea.intel.lua and ufs_common.lua (/lustre/f2/dev/ncep/JieShun.Zhu/ufsp8/p8_c5spack1.4.1/ufs-weather-model/modulefiles, and updated module-setup.sh (../tests).

The err file is /lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_255109/compile_001/.

My default modules are loaded as follows. 1) craype-x86-rome 9) cray-libsci/23.02.1.1 2) craype-network-ofi 10) PrgEnv-intel/8.3.3 3) perftools-base/23.03.0 11) cray-pmi/6.1.10 4) xpmem/2.6.2-2.5_2.27__gd067c3f.shasta 12) darshan-runtime/3.4.0 5) intel-classic/2022.2.1 13) CmrsEnv/default 6) craype/2.7.20 14) TimeZoneEDT/default 7) cray-dsmml/0.2.2 15) DefApps/default 8) cray-mpich/8.1.25

Can you please take a look at my problem? Thanks a lot.

natalie-perlin commented 1 year ago

@jieshunzhu - from the list of modules activated, it looked like it was done from Gaea c4, not c5. Could you please make sure to login to gaea-c5? The log file in /lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_255109/compile_001/err has the folllowing: err:37:+ MACHINE_ID=gaea.intel

jieshunzhu commented 1 year ago

@natalie-perlin, thanks for your quick response. No, the compilation did be done from c5. I am curious what is wrong with "MACHINE_ID=gaea.intel"? It should not be the case on c5?

natalie-perlin commented 1 year ago

@jieshunzhu -Gaea C5 is a different platform from C4, so it has to be specified as gaea-c5

jieshunzhu commented 1 year ago

@natalie-perlin when using C5 hpc-stack and spack-stack you pointed, we failed in compilation. But when using /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/c5/spack-stack-dev-20230717/envs/ufs-pio-2.5.10/install/modulefiles/Core, I was able to compile UFSp8. However, I got problems in running the executables, even if I had module darshan-runtime/3.4.0 loaded during a runtime . Here are my two running directories with errors:

Can you please give me some suggestions about it? Thanks in advance.

jieshunzhu commented 1 year ago

@jkbk2004 @natalie-perlin I dig further into my problem. It looks like the problem occurred in netcdf writing. The following error messages are from PET144.ESMF_LogFile. Is this problem related to PIO? Did you see the problem before?

======================================== 20231018 133633.031 INFO PET144 In ioCompRun() before writing to: atmf000.tile1.nc 20231018 133633.058 INFO PET144 In ioCompRun() after writing vectical and time dimensions. 20231018 133633.207 ERROR PET144 ESMCI_PIO_Handler.C:1184 ESMCI::PIO_Handler::arrayWriteOn The NetCDF Library returned an error - Attempting to end definition of variable: clwmr, (PIO/NetCDF error = NetCDF: Not a valid data type or _FillValue type mismatch) 20231018 133633.207 ERROR PET144 ESMCI_IO_Handler.C:455 ESMCI::IO_Handler::arrayWrite() The NetCDF Library returned an error - Internal subroutine call returned Error 20231018 133633.207 ERROR PET144 ESMCI_IO.C:723 ESMCI::IO::write() The NetCDF Library returned an error - Internal subroutine call returned Error 20231018 133633.213 ERROR PET144 ESMCI_IO.C:494 ESMCI::IO::write() The NetCDF Library returned an error - Internal subroutine call returned Error 20231018 133633.213 ERROR PET144 ESMCI_IO_F.C:171 c_esmc_iowrite() Unable to write to file - Internal subroutine call returned Error 20231018 133633.213 ERROR PET144 ESMF_IO.F90:523 ESMF_IOAddArray() Unable to write to file - Internal subroutine call returned Error 20231018 133633.213 ERROR PET144 ESMF_FieldBundle.F90:17682 ESMF_FieldBundleWrite() Unable to write to file - Internal subroutine call returned Error 20231018 133633.213 ERROR PET144 module_wrt_grid_comp.F90:3442 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 module_wrt_grid_comp.F90:3005 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 module_wrt_grid_comp.F90:2162 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 fv3_cap.F90:1085 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 fv3_cap.F90:942 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 ATM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:2218 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3642 Unable to write to file - Phase 'RunPhase1' Run for modelComp 2 did not return ESMF_SUCCESS 20231018 133633.213 ERROR PET144 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3880 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3557 Unable to write to file - Passing error in return code 20231018 133633.213 ERROR PET144 UFS.F90:403 Unable to write to file - Aborting UFS 20231018 133633.213 INFO PET144 Finalizing ESMF

jieshunzhu commented 1 year ago

@natalie-perlin @jkbk2004 To solve my problem of running P8 on C5, I tried to build hpc-stack on C5 using the old library versions that work on C4. For the build, I used intel-classic/2023.1.0 and cray-mpich/8.1.25. But I got errors in building esmf/8.3.0b09 (others look ok). Here is my build.log file: /lustre/f2/dev/ncep/JieShun.Zhu/util/hpc-stack/c5/src-intel-classic-2023.1.0/outZJS

Could you please give me some suggestion about it?

jieshunzhu commented 1 year ago

@natalie-perlin @jkbk2004 Just let you know that my esmf/8.3.0b09 problem is fixed. I am able to build hpc-stack using the old library. It works for P8 now. Next, I will work on libraries for P5. I may need to bother you about it. Thanks in advance.

natalie-perlin commented 1 year ago

@jieshunzhu - Please note that is not a sound approach to mix software installed on C4 and C5! They rely on different modules and dependencies. This may sort of "work" but no guarantee. Is there a chance to set a quick GoogleMeet meeting (via @noaa calendar)? It may noticeably speed up things and looking into building the stack you need or using the existing one.

jieshunzhu commented 1 year ago

configure files:
/lustre/f2/dev/ncep/JieShun.Zhu/ufsp5/ufs-s2s-model_zbotC5/NEMS/src/conf/module-setup.sh.inc /lustre/f2/dev/ncep/JieShun.Zhu/ufsp5/ufs-s2s-model_zbotC5/conf/configure.fv3.gaea.intel

(my old work with p5 are all on csh, but any shell is good for me once it works)

Model-p5 compile directory: /lustre/f2/dev/ncep/JieShun.Zhu/ufsp5/ufs-s2s-model_zbotC5/tests

Running directory (with linkage to proper input files/data): EXP25e1 (rt.sh -l rt_3moRST_noww3_cmeps.conf)

natalie-perlin commented 1 year ago

@jieshunzhu - getting to this issue now, and planning to build the stack with the modules for the P5 experiment