ufs-community / ufs-weather-model

UFS Weather Model
134 stars 243 forks source link

ufs_gaea.intel.lua not loading cmake/3.20.1 #1789

Closed JustinPerket closed 1 year ago

JustinPerket commented 1 year ago


To Reproduce:

On Gaea, with fresh checkout, tried, for example, datm_cdeps_lnd_gswp3 test:

./rt.sh  -k -n datm_cdeps_lnd_gswp3 intel

Resulting compile err file reads:

+ echo -n ' 1686059873,'
++ date
+ echo 'Compile started:  ' Tue 06 Jun 2023 09:57:53 AM EDT
+ /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests/compile.sh gaea -DAPP=LND 030 intel
++ uname -s
+ [[ Linux == Darwin ]]
++++ readlink -f -n /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests/compile.sh
+++ dirname /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests/compile.sh
++ cd /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests
++ pwd -P
+ readonly MYDIR=/lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests
+ MYDIR=/lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/tests
+ readonly ARGC=4
+ ARGC=4
+ [[ 4 -lt 2 ]]
+ clean_before=YES
+ clean_after=YES
+ BUILD_NAME=fv3_030
+ PATHTR=/lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model
++ pwd
+ BUILD_DIR=/lustre/f2/scratch/Justin.Perket/FV3_RT/rt_11551/compile_030/build_fv3_030
+ [[ gaea == cheyenne ]]
+ hostname
+ set +x
Lmod has detected the following error: The load_any function failed because it
could not find any of the following modules : cmake/3.20.1 cmake

Please check the spelling or version number. Also try "module spider ..."

Also make sure that all modulefiles written in TCL start with the string

While processing the following module(s):
    Module fullname  Module Filename
    ---------------  ---------------
    ufs_gaea.intel   /lustre/f2/dev/gfdl/Justin.Perket/UFSmodels/ufs-weather-model/modulefiles/ufs_gaea.intel.lua

Additional context

Possibly related to #1772

jieshunzhu commented 1 year ago

I got the same problem. Are there any modifications in hpc-stack? @jkbk2004

liuxiao37k commented 1 year ago

@JustinPerket @jieshunzhu It seems the default and loaded cmake version is now 3.23.1 for stack-intel/2022.0.2. I was able to compile the latest develop branch -DAPP=S2SWA by force-loading the ecbuild supported cmake version in ./modulefiles/ufs_gaea.intel.lua.

--load_any(pathJoin("cmake", os.getenv("cmake_ver") or "3.20.1"),"cmake")
jieshunzhu commented 1 year ago

@liuxiao37k Thanks for sharing your experience. By adopting your changes about cmake, I now get problems about cray-mpich/7.7.11. ++++++++++++++++++ Lmod has detected the following error: The load_any function failed because it could not find any of the following modules : cray-mpich/7.7.11 cray-mpich ++++++++++++++++++

BTW, I am testing an executable generated a week ago, which can be run last week. There must be some changes after about stack.

@jkbk2004 can you give me some advice on it?

liuxiao37k commented 1 year ago

@jieshunzhu I encountered a crash in another attempt just minutes ago (after a successful build an hour ago). Clearly, there are changes in action in the background...

jkbk2004 commented 1 year ago

Sorry! stuck in meeting today. let me test with develop branch. I was running ok yesterday. @natalie-perlin FYI

zach1221 commented 1 year ago

I can confirm it happens with develop as well. @jkbk2004 I'm trying to see if some change to the gaea modulefile will allow a workaround.

natalie-perlin commented 1 year ago

@jieshunzhu - no cray-mpich/7.7.11 is available on C3/C4 anymore.

https://github.com/ufs-community/ufs-weather-model/issues/1772 shows usage of the hpc-stack updated after the upgrades, in a separate directory, the stack is not interfering with the stack listed in the current develop branch modulefile ufs_gaea.intel.lua

I'm looking into the issue.

natalie-perlin commented 1 year ago

@JustinPerket @jieshunzhu @zach1221 @jkbk2004 - fixed the issue. It was related to Lmod initialization in attempt to adapt it to C5; reverted it back to yesterday's version. Compiled successfully and ran successfully, log file attached.

Please note that with this current modulefile, a meta-module hpc-cray-mpich/7.7.11 is loaded, which then loads any cray-mpich available, i.e., cray-mpich/7.7.20on C3 and C4 gaea partitions.

@liuxiao37k - nothing has been changed today, except for reverting back to a previous version of Lmod initialization, as stated above. Please let us know if you still experience any issues!


jkbk2004 commented 1 year ago

Thanks @natalie-perlin looks like it is running ok again.

liuxiao37k commented 1 year ago

Thanks @natalie-perlin, I was now able to successfully compile the develop and ran a few regression tests.

jieshunzhu commented 1 year ago

@natalie-perlin Thanks for the help. I can run my tests nows.