ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

ufs_gaea.intel.lua not loading cmake/3.20.1 #1789

Closed JustinPerket closed 1 year ago

JustinPerket commented 1 year ago

Description

To Reproduce:

On Gaea, with fresh checkout, tried, for example, datm_cdeps_lnd_gswp3 test:

./rt.sh  -k -n datm_cdeps_lnd_gswp3 intel

Resulting compile err file reads:

+ echo -n ' 1686059873,'
++ date
+ echo 'Compile started:  ' Tue 06 Jun 2023 09:57:53 AM EDT
+ /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests/compile.sh gaea -DAPP=LND 030 intel
+ SECONDS=0
++ uname -s
+ [[ Linux == Darwin ]]
++++ readlink -f -n /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests/compile.sh
+++ dirname /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests/compile.sh
++ cd /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests
++ pwd -P
+ readonly MYDIR=/lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests
+ MYDIR=/lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests
+ readonly ARGC=4
+ ARGC=4
+ [[ 4 -lt 2 ]]
+ MACHINE_ID=gaea
+ MAKE_OPT=-DAPP=LND
+ COMPILE_NR=_030
+ RT_COMPILER=intel
+ clean_before=YES
+ clean_after=YES
+ BUILD_NAME=fv3_030
+ PATHTR=/lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model
++ pwd
+ BUILD_DIR=/lustre/f2/scratch/Justin.Perket/FV3_RT/rt_11551/compile_030/build_fv3_030
+ [[ gaea == cheyenne ]]
+ BUILD_JOBS=8
+ hostname
+ set +x
Lmod has detected the following error: The load_any function failed because it
could not find any of the following modules : cmake/3.20.1 cmake

Please check the spelling or version number. Also try "module spider ..."

Also make sure that all modulefiles written in TCL start with the string
#%Module

While processing the following module(s):
    Module fullname  Module Filename
    ---------------  ---------------
    ufs_gaea.intel   /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/modulefiles/ufs_gaea.intel.lua

Additional context

Possibly related to #1772

jieshunzhu commented 1 year ago

I got the same problem. Are there any modifications in hpc-stack? @jkbk2004

liuxiao37k commented 1 year ago

@JustinPerket @jieshunzhu It seems the default and loaded cmake version is now 3.23.1 for stack-intel/2022.0.2. I was able to compile the latest develop branch -DAPP=S2SWA by force-loading the ecbuild supported cmake version in ./modulefiles/ufs_gaea.intel.lua.

--load_any(pathJoin("cmake", os.getenv("cmake_ver") or "3.20.1"),"cmake")
load("cmake/3.20.1")
jieshunzhu commented 1 year ago

@liuxiao37k Thanks for sharing your experience. By adopting your changes about cmake, I now get problems about cray-mpich/7.7.11. ++++++++++++++++++ Lmod has detected the following error: The load_any function failed because it could not find any of the following modules : cray-mpich/7.7.11 cray-mpich ++++++++++++++++++

BTW, I am testing an executable generated a week ago, which can be run last week. There must be some changes after about stack.

@jkbk2004 can you give me some advice on it?

liuxiao37k commented 1 year ago

@jieshunzhu I encountered a crash in another attempt just minutes ago (after a successful build an hour ago). Clearly, there are changes in action in the background...

jkbk2004 commented 1 year ago

Sorry! stuck in meeting today. let me test with develop branch. I was running ok yesterday. @natalie-perlin FYI

zach1221 commented 1 year ago

I can confirm it happens with develop as well. @jkbk2004 I'm trying to see if some change to the gaea modulefile will allow a workaround.

natalie-perlin commented 1 year ago

@jieshunzhu - no cray-mpich/7.7.11 is available on C3/C4 anymore.

https://github.com/ufs-community/ufs-weather-model/issues/1772 shows usage of the hpc-stack updated after the upgrades, in a separate directory, the stack is not interfering with the stack listed in the current develop branch modulefile ufs_gaea.intel.lua

I'm looking into the issue.

natalie-perlin commented 1 year ago

@JustinPerket @jieshunzhu @zach1221 @jkbk2004 - fixed the issue. It was related to Lmod initialization in attempt to adapt it to C5; reverted it back to yesterday's version. Compiled successfully and ran successfully, log file attached.

Please note that with this current modulefile, a meta-module hpc-cray-mpich/7.7.11 is loaded, which then loads any cray-mpich available, i.e., cray-mpich/7.7.20on C3 and C4 gaea partitions.

@liuxiao37k - nothing has been changed today, except for reverting back to a previous version of Lmod initialization, as stated above. Please let us know if you still experience any issues!

RegressionTests_gaea.log

jkbk2004 commented 1 year ago

Thanks @natalie-perlin looks like it is running ok again.

liuxiao37k commented 1 year ago

Thanks @natalie-perlin, I was now able to successfully compile the develop and ran a few regression tests.

jieshunzhu commented 1 year ago

@natalie-perlin Thanks for the help. I can run my tests nows.