ufs-community / ufs-weather-model

UFS Weather Model
Other
137 stars 246 forks source link

APP=S2SW does not build on Gaea #511

Closed DeniseWorthen closed 3 years ago

DeniseWorthen commented 3 years ago

Description

The app=s2sw fails to build on Gaea.

To Reproduce:

Checkout the develop branch of ufs-weather-model. Edit rt.conf to contain a single S2SW compile and save it as rt.test:

COMPILE | APP=S2SW SUITES=FV3_GFS_2017_coupled,FV3_GFS_2017_satmedmf_coupled,FV3_GFS_v15p2_coupled,FV3_GFS_v16_coupled            | - wcoss_cray  jet.intel       | fv3 |
RUN     | cpld_control_wave                                                                                                       | - wcoss_cray  jet.intel       | fv3 |

Be sure to remove the gaea.intel from the next to last column, otherwise rt.sh will just exit.

Run the test:

./rt.sh -l rt.test >output 2>&1 &

Look in the RT-directory that the jobs is compiling in: compile_001/build_fv3_001/ww3_make.log:

gmake[3]: Entering directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/esmf'
gmake[3]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.

                *****************************
              ***   WAVEWATCH III setup     ***
                *****************************

[INFO] local env file wwatch3.env found in /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/wwatch3.env
   Setup file /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/wwatch3.env found
      Printer (listings)          :
      auxiliary FORTRAN compiler  : gfortran
      auxiliary C compiler        : gcc
      Source directory            : /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model
      Scratch directory           : /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/tmp
      Save source code            : yes
      Save listings               : yes

   Setup makefile for auxiliary programs

   Compile auxiliary programs
make[4]: Entering directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/aux'
gfortran -o /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/w3adc w3adc.f
make[4]: gfortran: Command not found
make[4]: *** [makefile:10: /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/w3adc] Error 127
make[4]: Leaving directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/aux'

ERROR: Error occured during compile of auxiliary programs

gmake[3]: *** [Makefile:152: setup] Error 1
gmake[3]: Leaving directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/esmf'
JessicaMeixner-NOAA commented 3 years ago

So, when I submit a build job on Gaea I'm getting this error:

+ set +x
Lmod has detected the following error: The following module(s) are unknown:
"eproxy/2.0.24-7.0.2.1_2.20__g8e04b33.ari"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore-cache load "eproxy/2.0.24-7.0.2.1_2.20__g8e04b33.ari"

Also make sure that all modulefiles written in TCL start with the string
#%Module

which appears to be from this line in compile.sh: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/compile.sh#L66: source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh

Anyone else ever gotten this error? I can't get to the error @DeniseWorthen mentioned because of this right now. I've tried from a fresh clone to make sure I didn't do anything and I don't have much in my .cshrc file.

climbfuji commented 3 years ago

The last person reporting this error was using tcsh, I recommended switching to bash and never heard back. Either it worked or that person gave up.

JessicaMeixner-NOAA commented 3 years ago

@climbfuji I am using tcsh, so that's at least consistent.

climbfuji commented 3 years ago

Let me see if I can get this to work (remind me tomorrow, please) ... I use bash and the (t)csh version is not as well tested, obviously.

JessicaMeixner-NOAA commented 3 years ago

@climbfuji I made a seperate issue https://github.com/ufs-community/ufs-weather-model/issues/536 so this issue can get back to being about the S2SW not building.

@DeniseWorthen I'll try from the command line again, but I might need you to help test until I can get the other sorted out.

climbfuji commented 3 years ago

Let me see if I can get this to work (remind me tomorrow, please) ... I use bash and the (t)csh version is not as well tested, obviously.

The trouble is that I cannot reproduce the problem, because the following works:

Dom.Heinzeller@gaea14:~> export | grep SHELL
declare -x SHELL="/bin/bash"
Dom.Heinzeller@gaea14:~> tcsh
Directory: /ncrc/home2/Dom.Heinzeller
home2/Dom.Heinzeller> env | grep SHELL
SHELL=/bin/bash
home2/Dom.Heinzeller> source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh
Illegal variable name.
home2/Dom.Heinzeller> source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.csh
Activating lua module environment
Reloading modules ... (sit back and relax)
home2/Dom.Heinzeller>

Note that the environment variable SHELL still says bash, even though I am in a tcsh shell. Somehow it remembers aspects of my original bash login shell.

JessicaMeixner-NOAA commented 3 years ago

@climbfuji I also can load on the login node:

> source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.csh
Activating lua module environment
Reloading modules ... (sit back and relax)

but when you submit the job, the modules do not load. So it's hard. I'd be happy to do what I can to help test/reproduce the issues. I made another issue for this ( #536), should I close it?

DeniseWorthen commented 3 years ago

I was able to build @JessicaMeixner-NOAA gaea_ww3 branch using:

source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh
module use modulefiles/
module load ufs_gaea.intel
CMAKE_FLAGS="-DAPP=S2SW" CCPP_SUITES="FV3_GFS_2017_coupled,FV3_GFS_2017_satmedmf_coupled,FV3_GFS_v15p2_coupled" BUILD_VERBOSE=1 BUILD_JOBS=1 ./build.sh > output 2>&1 &
climbfuji commented 3 years ago

@JessicaMeixner-NOAA I told init_lmod.sh (and init_lmod.csh) to ignore errors while loading modules. With that I could switch to tcsh and submit a job card from the ufs-weather-model, which uses something like this:

#!/bin/bash -l
#SBATCH -e err.bash
#SBATCH -o out.bash
#SBATCH --job-name="init_lmod_bash_test"
#SBATCH --account=esrl_bmcs
#SBATCH --qos=normal
#SBATCH --clusters=c4
#SBATCH --ntasks=1
#SBATCH --time=5

set -eux

source ./module-setup.sh
source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh
module use $( pwd -P )
module load modules.fv3
module list

echo "Model started:  " `date`

sync && sleep 1
# here would be the call to srun ... fv3.exe

echo "Model ended:    " `date`

Can you check if this works for you? It's not an ideal solution, because if something changes with the module environment that breaks the init_lmod scripts we'll find out only when we compile the model / run the tests, but it's better than nothing (if it works). Thanks!

JessicaMeixner-NOAA commented 3 years ago

@climbfuji Where can I find the module-setup.sh file? I copied the /modulefiles/ufs_gaea.intel to modules.fv3, but it fails because there is no module-setup.sh.

climbfuji commented 3 years ago

Can you copy it from here for now? It's some file under NEMS with a different name.

/lustre/f2/scratch/Dom.Heinzeller/FV3_RT/init_lmod_test
JessicaMeixner-NOAA commented 3 years ago

I think it worked. My directory is here: /lustre/f2/scratch/ncep/Jessica.Meixner/init_lmod_test I ran tryfix.sub

climbfuji commented 3 years ago

Yes, looks good. Can you try building the APP S2SW?

JessicaMeixner-NOAA commented 3 years ago

I submitted rt.sh -e and that still failed... I guess I'll have to switch from tcsh to bash?

I think Denise can now build with the fix I suggested, so we can hopefully at least have that fixed, even if I can't do it myself.

DusanJovic-NOAA commented 3 years ago

I guess I'll have to switch from tcsh to bash?

Good idea. Regardless of this issue.

DeniseWorthen commented 3 years ago

Build on gaea was added in PR #533