ufs-community / ufs-weather-model

UFS Weather Model
Other
142 stars 249 forks source link

CDEPS regression test on WCOSS2 #2493

Closed sanAkel closed 2 weeks ago

sanAkel commented 2 weeks ago

Description

All CDEPS Data Atmosphere regression tests seem not be compiling/running on wcoss2.

Solution

Alternatives

N/A

Related to

RTOFS development.

cc: @aerorahul, @junwang-noaa

DeniseWorthen commented 2 weeks ago

@sanAkel The -wcoss2 in the RUN line means that these tests don't store baselines on WCOSS2. If you're trying to just run the tests, they'll fail w/ an error that there is no baseline to compare against. If that is the issue, you can generate your own baseline and then use the run directories as a sandbox.

EDIT: Actually these tests don't even compile on WCOSS2 (see the compile line).

sanAkel commented 2 weeks ago

@sanAkel The -wcoss2 in the RUN line means that these tests don't store baselines on WCOSS2. If you're trying to just run the tests, they'll fail w/ an error that there is no baseline to compare against. If that is the issue, you can generate your own baseline and then use the run directories as a sandbox.

@DeniseWorthen This ⬇️

EDIT: Actually these tests don't even compile on WCOSS2 (see the compile line).

Is exactly what I thought.

To be more precise:

  1. I would like to build,
  2. Run

On wcoss2.

To reiterate, what should I try doing to accomplish above 1 and 2?

DeniseWorthen commented 2 weeks ago

@sanAkel You need to remove the -wcoss2 where appropriate. The input data directories are the same across all platforms, so if it compiles, it should run. You can try just compiling first by using (in the tests directory, add your own names etc. )

./compile.sh wcoss2 "-DAPP=NG-GODAS" test.exe intel NO NO 2>&1 | tee test.compile.log
sanAkel commented 2 weeks ago

Thanks a lot @DeniseWorthen, will try that when cactus lets me log in.

Cactus let me login; updating.

sanAkel commented 2 weeks ago

@DeniseWorthen, since I have a build, I would like to use run_test.sh so I can run a test case. Not sure what's wrong in ⬇️

./run_test.sh ./ ./tests/ datm_cdeps_mx025_gefs datm_cdeps NG-GODAS

It throws an error with MACHINE_ID:

+ set -o pipefail
+ echo PID=1066443
PID=1066443
+ SECONDS=0
+ trap '[ "$?" -eq 0 ] || write_fail_test' EXIT
+ trap 'echo "run_test.sh interrupted PID=$$"; cleanup' INT
+ trap 'echo "run_test.sh terminated PID=$$";  cleanup' TERM
+ [[ 5 != 5 ]]
+ export PATHRT=./
+ PATHRT=./
+ export RUNDIR_ROOT=./tests/
+ RUNDIR_ROOT=./tests/
+ export TEST_NAME=datm_cdeps_mx025_gefs
+ TEST_NAME=datm_cdeps_mx025_gefs
+ export TEST_ID=datm_cdeps
+ TEST_ID=datm_cdeps
+ export COMPILE_ID=NG-GODAS
+ COMPILE_ID=NG-GODAS
+ echo 'PATHRT: ./'
PATHRT: ./
+ echo 'RUNDIR_ROOT: ./tests/'
RUNDIR_ROOT: ./tests/
+ echo 'TEST_NAME: datm_cdeps_mx025_gefs'
TEST_NAME: datm_cdeps_mx025_gefs
+ echo 'TEST_ID: datm_cdeps'
TEST_ID: datm_cdeps
+ echo 'COMPILE_ID: NG-GODAS'
COMPILE_ID: NG-GODAS
+ cd ./
+ unset MODEL_CONFIGURE
+ unset UFS_CONFIGURE
+ [[ -e ./tests//run_test_datm_cdeps.env ]]
+ source default_vars.sh
++ THRD=1
++ export INPES_atmaero=4
++ INPES_atmaero=4
++ export JNPES_atmaero=8
++ JNPES_atmaero=8
++ export WPG_atmaero=6
++ WPG_atmaero=6
++ export THRD_cpl_atmw=1
++ THRD_cpl_atmw=1
++ export INPES_cpl_atmw=3
++ INPES_cpl_atmw=3
++ export JNPES_cpl_atmw=8
++ JNPES_cpl_atmw=8
++ export WPG_cpl_atmw=6
++ WPG_cpl_atmw=6
++ export WAV_tasks_cpl_atmw=30
++ WAV_tasks_cpl_atmw=30
++ export WAV_thrds_cpl_atmw=1
++ WAV_thrds_cpl_atmw=1
++ export THRD_cpl_c48=1
++ THRD_cpl_c48=1
++ export INPES_cpl_c48=1
++ INPES_cpl_c48=1
++ export JNPES_cpl_c48=1
++ JNPES_cpl_c48=1
++ export WPG_cpl_c48=6
++ WPG_cpl_c48=6
++ export OCN_tasks_cpl_c48=4
++ OCN_tasks_cpl_c48=4
++ export ICE_tasks_cpl_c48=4
++ ICE_tasks_cpl_c48=4
++ export THRD_cpl_dflt=1
++ THRD_cpl_dflt=1
++ export INPES_cpl_dflt=3
++ INPES_cpl_dflt=3
++ export JNPES_cpl_dflt=8
++ JNPES_cpl_dflt=8
++ export WPG_cpl_dflt=6
++ WPG_cpl_dflt=6
++ export OCN_tasks_cpl_dflt=20
++ OCN_tasks_cpl_dflt=20
++ export ICE_tasks_cpl_dflt=10
++ ICE_tasks_cpl_dflt=10
++ export WAV_tasks_cpl_dflt=20
++ WAV_tasks_cpl_dflt=20
++ export THRD_cpl_thrd=2
++ THRD_cpl_thrd=2
++ export INPES_cpl_thrd=3
++ INPES_cpl_thrd=3
++ export JNPES_cpl_thrd=4
++ JNPES_cpl_thrd=4
++ export WPG_cpl_thrd=6
++ WPG_cpl_thrd=6
++ export OCN_tasks_cpl_thrd=20
++ OCN_tasks_cpl_thrd=20
++ export OCN_thrds_cpl_thrd=1
++ OCN_thrds_cpl_thrd=1
++ export ICE_tasks_cpl_thrd=10
++ ICE_tasks_cpl_thrd=10
++ export ICE_thrds_cpl_thrd=1
++ ICE_thrds_cpl_thrd=1
++ export WAV_tasks_cpl_thrd=12
++ WAV_tasks_cpl_thrd=12
++ export WAV_thrds_cpl_thrd=2
++ WAV_thrds_cpl_thrd=2
++ export THRD_cpl_dcmp=1
++ THRD_cpl_dcmp=1
++ export INPES_cpl_dcmp=4
++ INPES_cpl_dcmp=4
++ export JNPES_cpl_dcmp=6
++ JNPES_cpl_dcmp=6
++ export WPG_cpl_dcmp=6
++ WPG_cpl_dcmp=6
++ export OCN_tasks_cpl_dcmp=20
++ OCN_tasks_cpl_dcmp=20
++ export ICE_tasks_cpl_dcmp=10
++ ICE_tasks_cpl_dcmp=10
++ export WAV_tasks_cpl_dcmp=20
++ WAV_tasks_cpl_dcmp=20
++ export THRD_cpl_mpi=1
++ THRD_cpl_mpi=1
++ export INPES_cpl_mpi=4
++ INPES_cpl_mpi=4
++ export JNPES_cpl_mpi=8
++ JNPES_cpl_mpi=8
++ export WPG_cpl_mpi=6
++ WPG_cpl_mpi=6
++ export OCN_tasks_cpl_mpi=34
++ OCN_tasks_cpl_mpi=34
++ export ICE_tasks_cpl_mpi=20
++ ICE_tasks_cpl_mpi=20
++ export WAV_tasks_cpl_mpi=28
++ WAV_tasks_cpl_mpi=28
++ export THRD_cpl_bmrk=2
++ THRD_cpl_bmrk=2
++ export INPES_cpl_bmrk=8
++ INPES_cpl_bmrk=8
++ export JNPES_cpl_bmrk=8
++ JNPES_cpl_bmrk=8
++ export WPG_cpl_bmrk=48
++ WPG_cpl_bmrk=48
++ export OCN_tasks_cpl_bmrk=120
++ OCN_tasks_cpl_bmrk=120
++ export OCN_thrds_cpl_bmrk=1
++ OCN_thrds_cpl_bmrk=1
++ export ICE_tasks_cpl_bmrk=48
++ ICE_tasks_cpl_bmrk=48
++ export ICE_thrds_cpl_bmrk=1
++ ICE_thrds_cpl_bmrk=1
++ export WAV_tasks_cpl_bmrk=80
++ WAV_tasks_cpl_bmrk=80
++ export WAV_thrds_cpl_bmrk=2
++ WAV_thrds_cpl_bmrk=2
++ export THRD_cpl_c192=2
++ THRD_cpl_c192=2
++ export INPES_cpl_c192=6
++ INPES_cpl_c192=6
++ export JNPES_cpl_c192=8
++ JNPES_cpl_c192=8
++ export WPG_cpl_c192=12
++ WPG_cpl_c192=12
++ export OCN_tasks_cpl_c192=60
++ OCN_tasks_cpl_c192=60
++ export ICE_tasks_cpl_c192=24
++ ICE_tasks_cpl_c192=24
++ export WAV_tasks_cpl_c192=80
++ WAV_tasks_cpl_c192=80
++ export ATM_compute_tasks_cdeps_100=12
++ ATM_compute_tasks_cdeps_100=12
++ export OCN_tasks_cdeps_100=16
++ OCN_tasks_cdeps_100=16
++ export ICE_tasks_cdeps_100=12
++ ICE_tasks_cdeps_100=12
++ export ATM_compute_tasks_cdeps_025=40
++ ATM_compute_tasks_cdeps_025=40
++ export OCN_tasks_cdeps_025=120
++ OCN_tasks_cdeps_025=120
++ export ICE_tasks_cdeps_025=48
++ ICE_tasks_cdeps_025=48
++ export INPES_aqm=33
++ INPES_aqm=33
++ export JNPES_aqm=8
++ JNPES_aqm=8
++ export THRD_cpl_unstr=1
++ THRD_cpl_unstr=1
++ export INPES_cpl_unstr=3
++ INPES_cpl_unstr=3
++ export JNPES_cpl_unstr=8
++ JNPES_cpl_unstr=8
++ export WPG_cpl_unstr=6
++ WPG_cpl_unstr=6
++ export OCN_tasks_cpl_unstr=20
++ OCN_tasks_cpl_unstr=20
++ export ICE_tasks_cpl_unstr=10
++ ICE_tasks_cpl_unstr=10
++ export WAV_tasks_cpl_unstr=60
++ WAV_tasks_cpl_unstr=60
++ export THRD_cpl_unstr_mpi=1
++ THRD_cpl_unstr_mpi=1
++ export INPES_cpl_unstr_mpi=4
++ INPES_cpl_unstr_mpi=4
++ export JNPES_cpl_unstr_mpi=8
++ JNPES_cpl_unstr_mpi=8
++ export WPG_cpl_unstr_mpi=6
++ WPG_cpl_unstr_mpi=6
++ export OCN_tasks_cpl_unstr_mpi=34
++ OCN_tasks_cpl_unstr_mpi=34
++ export ICE_tasks_cpl_unstr_mpi=20
++ ICE_tasks_cpl_unstr_mpi=20
++ export WAV_tasks_cpl_unstr_mpi=50
++ WAV_tasks_cpl_unstr_mpi=50
++ export aqm_omp_num_threads=1
++ aqm_omp_num_threads=1
++ export atm_omp_num_threads=1
++ atm_omp_num_threads=1
++ export chm_omp_num_threads=1
++ chm_omp_num_threads=1
++ export ice_omp_num_threads=1
++ ice_omp_num_threads=1
++ export lnd_omp_num_threads=1
++ lnd_omp_num_threads=1
++ export med_omp_num_threads=1
++ med_omp_num_threads=1
++ export ocn_omp_num_threads=1
++ ocn_omp_num_threads=1
++ export wav_omp_num_threads=1
++ wav_omp_num_threads=1
++ export fbh_omp_num_threads=1
++ fbh_omp_num_threads=1
default_vars.sh: line 121: MACHINE_ID: unbound variable
+++ '[' 1 -eq 0 ']'
+++ write_fail_test
+++ echo 'datm_cdeps failed in run_test'
+++ [[ false == true ]]
+++ [[ false == true ]]
+++ exit 0
DeniseWorthen commented 2 weeks ago

I've never tried to use run_test that way. I'm not sure it would work. What I would do is create a rt.test and put whichever datm configuration you're interest in there but remove the -wcoss2:

### CDEPS Data Atmosphere tests ###
COMPILE | datm_cdeps | intel | -DAPP=NG-GODAS |     | fv3 |
RUN | datm_cdeps_control_cfsr                           |                              | baseline |
RUN | datm_cdeps_restart_cfsr                            | - noaacloud        |          | datm_cdeps_control_cfsr
....

Then, run rt.sh to create a baseline (it will fail otherwise, since no baseline is on wcoss2) and keep the run directories (to use as a sandbox):

 ./rt.sh  -cek -l rt.test -a nems >output 2>&1 &

change nems to whatever account you have access to.

sanAkel commented 2 weeks ago

@DeniseWorthen Thanks for the advice.

I tried ⬇️ test case:

### CDEPS Data Atmosphere tests ###
COMPILE | datm_cdeps | intel | -DAPP=NG-GODAS |     | fv3 |
RUN     | datm_cdeps_mx025_gefs               |     | baseline |

Use rt.sh as you suggested:

./rt.sh -a couple -cekv -l my_rt.conf3 |& tee test3.log

But it failed! (Path to my clone: /u/santha.akella/ufs-wm-08Nov2024/) Thanks again for helping!

tail -n 100 test3.log

+ grep -q quota /u/santha.akella/ufs-wm-08Nov2024/tests/logs/log_wcoss2/compile_datm_cdeps_intel.log
+ grep -q 'TIME LIMIT' /lfs/h2/emc/ptmp/santha.akella/FV3_RT/rt_3750146/compile_datm_cdeps_intel/err
grep: /lfs/h2/emc/ptmp/santha.akella/FV3_RT/rt_3750146/compile_datm_cdeps_intel/err: No such file or directory
+ echo
+ echo 'FAILED: UNABLE TO FINISH COMPILE -- COMPILE '\''datm_cdeps_intel'\'' [, ]'
+ [[ -n /u/santha.akella/ufs-wm-08Nov2024/tests/logs/log_wcoss2/compile_datm_cdeps_intel.log ]]
+ FAILED_COMPILES+=("COMPILE ${COMPILE_ID}: ${COMPILE_RESULT}")
+ [[ -n /u/santha.akella/ufs-wm-08Nov2024/tests/logs/log_wcoss2/compile_datm_cdeps_intel.log ]]
+ FAILED_COMPILE_LOGS+=("${FAIL_LOG}")
+ read -r line
+ line='RUN     | datm_cdeps_mx025_gefs               |     | baseline |'
+ [[ -n RUN     | datm_cdeps_mx025_gefs               |     | baseline | ]]
+ [[ 64 == 0 ]]
+ [[ RUN     | datm_cdeps_mx025_gefs               |     | baseline | == \#* ]]
+ local valid_compile=false
+ local valid_test=false
+ [[ RUN     | datm_cdeps_mx025_gefs               |     | baseline | == COMPILE* ]]
+ [[ RUN     | datm_cdeps_mx025_gefs               |     | baseline | =~ RUN ]]
+ [[ false == true ]]
++ cut -d '|' -f3
+ RMACHINES='     '
++ sed -e 's/^ *//' -e 's/ *$//'
+ RMACHINES=
++ cut -d '|' -f2
+ TEST_NAME=' datm_cdeps_mx025_gefs               '
++ sed -e 's/^ *//' -e 's/ *$//'
+ TEST_NAME=datm_cdeps_mx025_gefs
++ cut -d '|' -f4
+ GEN_BASELINE=' baseline '
++ sed -e 's/^ *//' -e 's/ *$//'
+ GEN_BASELINE=baseline
+ [[ '' == '' ]]
+ valid_test=true
+ [[ true == true ]]
+ TEST_COUNTER=1
+ GETMEMFROMLOG=
+ FAIL_LOG=
+ TEST_RESULT=
+ TIME_FILE=
+ TEST_TIME=
+ RT_TEST_TIME=
+ RT_TEST_MEM=
+ [[ true == true ]]
+ [[ baseline != \b\a\s\e\l\i\n\e ]]
+ [[ FAILED: UNABLE TO FINISH COMPILE =~ FAILED ]]
+ TEST_RESULT='SKIPPED: ASSOCIATED COMPILE FAILED'
+ SKIPPED_TESTS+=("TEST ${TEST_NAME}_${COMPILER}: ${TEST_RESULT}")
+ [[ SKIPPED: ASSOCIATED COMPILE FAILED == \P\A\S\S ]]
+ echo 'SKIPPED: ASSOCIATED COMPILE FAILED -- TEST '\''datm_cdeps_mx025_gefs_intel'\'' [, ]( MB)'
+ [[ -n '' ]]
+ [[ -n '' ]]
+ [[ -n '' ]]
+ read -r line
++ printf '%02dh:%02dm:%02ds\n' 0 1 54
+ elapsed_time=00h:01m:54s
+ cat
+ [[ 1 -ne 0 ]]
+ echo 'Failed Compiles:'
+ for i in "${!FAILED_COMPILES[@]}"
+ echo '* COMPILE datm_cdeps_intel: FAILED: UNABLE TO FINISH COMPILE'
+ echo '-- LOG: /u/santha.akella/ufs-wm-08Nov2024/tests/logs/log_wcoss2/compile_datm_cdeps_intel.log'
+ [[ 0 -ne 0 ]]
+ [[ 0 -ne 0 ]]
+ [[ 1 -eq 0 ]]
+ cat
+ echo 'REGRESSION TEST RESULT: FAILURE'
REGRESSION TEST RESULT: FAILURE
+ echo '******Regression Testing Script Completed******'
******Regression Testing Script Completed******
+ echo 'rt.sh finished'
rt.sh finished
+ cleanup
+ echo 'rt.sh: Cleaning up...'
rt.sh: Cleaning up...
++ awk '{print $2}'
+ awk_info=3750146
+ [[ 3750146 == \3\7\5\0\1\4\6 ]]
+ rm -rf /u/santha.akella/ufs-wm-08Nov2024/tests/lock
+ [[ true == true ]]
+ ecflow_stop
+ [[ true == true ]]
+ echo 'rt_utils.sh: Checking whether to stop ecflow_server...'
rt_utils.sh: Checking whether to stop ecflow_server...
+ set +e
++ ecflow_client --get
+ SUITES='#5.6.0
# enddef'
++ grep '^suite'
+ SUITES=
+ [[ -z '' ]]
+ echo 'rt_utils.sh: No other suites running, stopping ecflow_server'
rt_utils.sh: No other suites running, stopping ecflow_server
+ ecflow_client --halt=yes
+ ecflow_client --check_pt
+ ecflow_client --terminate=yes
+ set -e
+ trap 0
+ echo 'rt.sh: Exiting.'
rt.sh: Exiting.
+ exit
jiandewang commented 2 weeks ago

I just repeated what you did and that worked fine for me. I think you are not using the right account name. ./rt.sh -a couple -cekv -l my_rt.conf3

on wcoss my account is GFS-DEV, so yours must be something similar to mine (xxx-DEV). There is no way to find out this on the machine using unix command and the only way you can find out is ask people in the same group as you or ask wcoss SA

jiandewang commented 2 weeks ago

also remove -e as you are only running one job

sanAkel commented 2 weeks ago

Thanks @jiandewang

I was assuming that account is one of the output(s) of groups, as in:

santha.akella@clogin02:~/ufs-wm-08Nov2024/tests> groups emc couple backupsys

That was an error on my part!

I went back to the email that notified me of my account creation, found this link to the projects from documentation pages and tried... got RTOFS-DEV to work for me. It compiled and ran. Yay! Thanks again!

sanAkel commented 2 weeks ago

Solution/ Alternatives

Use regression test: ./rt.sh -a <ACCOUNT> -ckv -l my_rt.conf |& tee test.log

where my_rt.conf:

cat my_rt.conf3

### CDEPS Data Atmosphere tests ###
COMPILE | datm_cdeps | intel | -DAPP=NG-GODAS |     | fv3 |
RUN     | datm_cdeps_mx025_gefs               |     | baseline |