ufs-community / ufs-weather-model

UFS Weather Model
Other
138 stars 249 forks source link

RT atm_ds2s_docn_dice failing on S4 and Jet #2385

Closed InnocentSouopgui-NOAA closed 2 months ago

InnocentSouopgui-NOAA commented 3 months ago

Description

The Regression Test atm_ds2s_docn_dice is failing on S4 and Jet. It fails to create the baseline.

To Reproduce:

This is happening with Intel. I tried on S4 and Jet and it failed on both clusters. To reproduce the bug on one of those clusters,

  1. clone the UFS weather model and move to the test directory
  2. run the test test to create the baseline ./rt.sh -c -e -n "atm_ds2s_docn_dice intel" -a "ACCOUNT_NAME"
  3. or run to create all baselines ./rt.sh -c -e -a "ACCOUNT_NAME"

Additional context

Output

output logs

ECFLOW Tasks Remaining: 0/3 rt_utils.sh: ECFLOW tasks completed, cleaning up suite
rt.sh: Generating Regression Testing Log...

grep: <...>/FV3_RT/rt_3364114/atm_ds2s_docn_dice_intel/err: No such file or directory
REGRESSION TEST RESULT: FAILURE
******Regression Testing Script Completed******
rt.sh finished
rt.sh: Cleaning up...
rt_utils.sh: Checking whether to stop ecflow_server...
rt_utils.sh: No other suites running, stopping ecflow_server
rt.sh: Exiting.
NickSzapiro-NOAA commented 3 months ago

Thanks for this issue, @InnocentSouopgui-NOAA. So this test is dependent on cpld_control_nowave_noaero_p8 and this test fails first (<...>/FV3_RT/rt_3364114/cpld_control_nowave_noaero_p8_intel/err). The PET000.ESMF_LogFile shows an error that "UFSDriver.F90:543 Not valid - No component mom6 found"

I don't see where you have your ufs-weather-model code. May I check that you ran (something like):

git clone https://github.com/ufs-community/ufs-weather-model.git
cd ufs-weather-model/
git submodule update --init --recursive

This last command may have been skipped.

InnocentSouopgui-NOAA commented 3 months ago

My working copy is at /mnt/lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/RT/weekly_20240729

I use --recursive in the git clone command.

NickSzapiro-NOAA commented 3 months ago

I'm struggling to reproduce your error. Can you check if these steps also work for you on Jet

 git clone https://github.com/ufs-community/ufs-weather-model.git
 cd ufs-weather-model/
 git submodule update --init --recursive
 cd tests
 nohup ./rt.sh -a ACCOUNT_NAME -e -k -l my_rt.conf &

where my_rt.conf is

COMPILE | s2sw | intel | -DAPP=S2SW -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 |                                 | fv3 |
RUN | cpld_control_nowave_noaero_p8                     | - noaacloud                          | baseline |
COMPILE | atm_ds2s_docn_dice | intel | -DAPP=ATM_DS2S  -DCCPP_SUITES=FV3_GFS_v17_coupled_p8    | - wcoss2 acorn  | fv3 |
RUN | atm_ds2s_docn_dice                                | - noaacloud wcoss2 acorn             | baseline | cpld_control_nowave_noaero_p8

Tests pass for me, with baseline dir = /mnt/lfs4/HFIP/hfv3gfs/role.epic/RT/NEMSfv3gfs/develop-20240730/atm_ds2s_docn_dice_intel working dir = /lfs4/HFIP/h-nems/Nick.Szapiro/RT_RUNDIRS/Nick.Szapiro/FV3_RT/rt_3491414/atm_ds2s_docn_dice_intel

NickSzapiro-NOAA commented 3 months ago

It's possible that git clone --recurse-submodules may work instead of git clone --recursive but the steps in the docs are here: https://github.com/ufs-community/ufs-weather-model/wiki/Making-code-changes-in-the-UFS-weather-model-and-its-subcomponents#checkout-and-update

I don't think this is related to the atm_ds2s_docn_dice test though

NickSzapiro-NOAA commented 3 months ago

@InnocentSouopgui-NOAA I think I can close this. Please feel free to re-open or reach out if I can help, especially for this test.

For adding this to global-workflow, I'll mention that there are several options to generate the cplhist files for input via CDEPS. Happy to discuss if you would like

InnocentSouopgui-NOAA commented 3 months ago

I'm struggling to reproduce your error. Can you check if these steps also work for you on Jet

 git clone https://github.com/ufs-community/ufs-weather-model.git
 cd ufs-weather-model/
 git submodule update --init --recursive
 cd tests
 nohup ./rt.sh -a ACCOUNT_NAME -e -k -l my_rt.conf &

where my_rt.conf is

COMPILE | s2sw | intel | -DAPP=S2SW -DCCPP_SUITES=FV3_GFS_v17_coupled_p8 |                                 | fv3 |
RUN | cpld_control_nowave_noaero_p8                     | - noaacloud                          | baseline |
COMPILE | atm_ds2s_docn_dice | intel | -DAPP=ATM_DS2S  -DCCPP_SUITES=FV3_GFS_v17_coupled_p8    | - wcoss2 acorn  | fv3 |
RUN | atm_ds2s_docn_dice                                | - noaacloud wcoss2 acorn             | baseline | cpld_control_nowave_noaero_p8

Tests pass for me, with baseline dir = /mnt/lfs4/HFIP/hfv3gfs/role.epic/RT/NEMSfv3gfs/develop-20240730/atm_ds2s_docn_dice_intel working dir = /lfs4/HFIP/h-nems/Nick.Szapiro/RT_RUNDIRS/Nick.Szapiro/FV3_RT/rt_3491414/atm_ds2s_docn_dice_intel

I followed those steps on both Jet and S4. The run on Jet is still waiting in the queue. On S4 it still fails. Especially, on S4 I am the one to run regular Regression Tests. So I need to create the baselines first. with -c it fails with the same error message as before. cpld_control_nowave_noaero_p8_intel ran without any error message.

NickSzapiro-NOAA commented 3 months ago

Can you post any error information from S4? Were you able to run without the "-c" flag?

InnocentSouopgui-NOAA commented 3 months ago

Bellow are the error messages (out put of rt.sh command, and run_atm_ds2s_docn_dice_intel.log) from S4. It looks like S4 does not have the correct version of nco

ECFLOW Tasks Remaining: 4/4 ................................................................
ECFLOW Tasks Remaining: 3/4 .....
ECFLOW Tasks Remaining: 2/4 ......................................
ECFLOW Tasks Remaining: 1/4 ....
ECFLOW Tasks Remaining: 0/4 rt_utils.sh: ECFLOW tasks completed, cleaning up suite
rt.sh: Generating Regression Testing Log...

grep: /scratch/users/isouopgui/FV3_RT/rt_271984/atm_ds2s_docn_dice_intel/err: No such file or directory
REGRESSION TEST RESULT: FAILURE
******Regression Testing Script Completed******
rt.sh finished
rt.sh: Cleaning up...
rt_utils.sh: Checking whether to stop ecflow_server...
rt_utils.sh: No other suites running, stopping ecflow_server
rt.sh: Exiting.

Looking into run_atm_ds2s_docn_dice_intel.log, I see the following error.

++ module load nco
+++ /usr/share/lmod/lmod/libexec/lmod bash load nco
Lmod has detected the following error: Cannot load module "nco/5.0.4" without
these module(s) loaded:
   udunits2 netcdf4

While processing the following module(s):
    Module fullname  Module Filename
    ---------------  ---------------
    nco/5.0.4        /opt/apps/modulefiles/Compiler/intel/22/nco/5.0.4.lua

If you can point me where to adapt for s4, I will work ion it.

NickSzapiro-NOAA commented 3 months ago

Ah, ok. So I don't have access to S4 but this line: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/cpld_docn_dice.IN#L10 would need to look something like: module load udunits2 netcdf4 nco for S4 instead for an ncrcat call just below that. You may want to check on command line first.

This may not work, like on WCOSS2 something similar failed because of conflicting modules. Note that this is only a temporary implementation until we bring back a fix for histaux output from ESCOMP/CMEPS, which should work on all platforms.

InnocentSouopgui-NOAA commented 3 months ago

Please keep the issue opened, I will work on it next week and submit a pull request.

Thanks again.

NickSzapiro-NOAA commented 3 months ago

Sure. Sorry for the trouble. fyi, we expect to remove the nco/ncrcat used there soon anyways

InnocentSouopgui-NOAA commented 3 months ago

On S4 cpld_control_nowave_noaero_p8 is failing as and the RT driver is not noticing. Below is the content of FV3_RT/rt_334524/cpld_control_nowave_noaero_p8_intel/err

++ date +%s
+ echo -n ' 1722876036,'
+ set +x

Currently Loaded Modules:
  1) license_intel/S4                 22) bacio/2.4.1
  2) intel/2022.1                     23) crtm-fix/2.4.0.1_emc
  3) stack-intel/2021.5.0             24) git-lfs/2.10.0
  4) stack-intel-oneapi-mpi/2021.5.0  25) crtm/2.4.0.1
  5) libjpeg/2.1.0                    26) g2/3.4.5
  6) jasper/2.0.32                    27) g2tmpl/1.10.2
  7) zlib/1.2.13                      28) ip/4.3.0
  8) libpng/1.6.37                    29) sp/2.5.0
  9) pkg-config/0.27.1                30) w3emc/2.10.0
 10) hdf5/1.14.0                      31) gftl/1.10.0
 11) snappy/1.1.10                    32) gftl-shared/1.6.1
 12) zstd/1.5.2                       33) fargparse/1.5.0
 13) c-blosc/1.21.5                   34) gettext/0.19.8.1
 14) nghttp2/1.57.0                   35) libxcrypt/4.4.35
 15) curl/8.4.0                       36) sqlite/3.43.2
 16) netcdf-c/4.9.2                   37) util-linux-uuid/2.38.1
 17) netcdf-fortran/4.6.1             38) python/3.10.13
 18) parallel-netcdf/1.12.2           39) mapl/2.40.3-esmf-8.6.0
 19) parallelio/2.5.10                40) scotch/7.0.4
 20) esmf/8.6.0                       41) ufs_common
 21) fms/2023.04                      42) modules.fv3

+ ulimit -s unlimited
++ date
+ echo 'Model started:  ' Mon Aug 5 16:40:38 GMT 2024
+ export MPI_TYPE_DEPTH=20
+ MPI_TYPE_DEPTH=20
+ export OMP_STACKSIZE=512M
+ OMP_STACKSIZE=512M
+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ export ESMF_RUNTIME_COMPLIANCECHECK=OFF:depth=4
+ ESMF_RUNTIME_COMPLIANCECHECK=OFF:depth=4
+ export PSM_RANKS_PER_CONTEXT=4
+ PSM_RANKS_PER_CONTEXT=4
+ export PSM_SHAREDCONTEXTS=1
+ PSM_SHAREDCONTEXTS=1
+ sync
+ sleep 1
+ '[' NO = WHEN_RUNNING ']'
+ srun --label -n 192 ./fv3.exe
150: Abort(1) on node 150 (rank 150 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 150
165: Abort(1) on node 165 (rank 165 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 165
177: Abort(1) on node 177 (rank 177 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 177
179: Abort(1) on node 179 (rank 179 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 179
182: Abort(1) on node 182 (rank 182 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 182
185: Abort(1) on node 185 (rank 185 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 185
187: Abort(1) on node 187 (rank 187 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 187
191: Abort(1) on node 191 (rank 191 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 191
160: Abort(1) on node 160 (rank 160 in comm 496): application called MPI_Abort(comm=0x84000003, 1) - process 160
161: Abort(1) on node 161 (rank 161 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 161
162: Abort(1) on node 162 (rank 162 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 162
163: Abort(1) on node 163 (rank 163 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 163
164: Abort(1) on node 164 (rank 164 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 164
166: Abort(1) on node 166 (rank 166 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 166
167: Abort(1) on node 167 (rank 167 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 167
168: Abort(1) on node 168 (rank 168 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 168
169: Abort(1) on node 169 (rank 169 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 169
170: Abort(1) on node 170 (rank 170 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 170
171: Abort(1) on node 171 (rank 171 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 171
172: Abort(1) on node 172 (rank 172 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 172
151: Abort(1) on node 151 (rank 151 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 151
152: Abort(1) on node 152 (rank 152 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 152
153: Abort(1) on node 153 (rank 153 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 153
154: Abort(1) on node 154 (rank 154 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 154
155: Abort(1) on node 155 (rank 155 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 155
157: Abort(1) on node 157 (rank 157 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 157
158: Abort(1) on node 158 (rank 158 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 158
159: Abort(1) on node 159 (rank 159 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 159
156: Abort(1) on node 156 (rank 156 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 156
173: Abort(1) on node 173 (rank 173 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 173
174: Abort(1) on node 174 (rank 174 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 174
175: Abort(1) on node 175 (rank 175 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 175
176: Abort(1) on node 176 (rank 176 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 176
178: Abort(1) on node 178 (rank 178 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 178
180: Abort(1) on node 180 (rank 180 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 180
181: Abort(1) on node 181 (rank 181 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 181
183: Abort(1) on node 183 (rank 183 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 183
184: Abort(1) on node 184 (rank 184 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 184
186: Abort(1) on node 186 (rank 186 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 186
188: Abort(1) on node 188 (rank 188 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 188
189: Abort(1) on node 189 (rank 189 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 189
190: Abort(1) on node 190 (rank 190 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 190
srun: error: s4-203-c15: tasks 128-159: Exited with exit code 1
srun: Terminating job step 29110319.0
  0: slurmstepd: error: *** STEP 29110319.0 ON s4-203-c11 CANCELLED AT 2024-08-05T16:41:11 ***
srun: error: s4-203-c13: tasks 64-95: Exited with exit code 1
srun: error: s4-203-c14: tasks 96-127: Exited with exit code 1
srun: error: s4-203-c11: tasks 0-11,13-31: Exited with exit code 1
srun: error: s4-203-c11: task 12: Segmentation fault
srun: error: s4-203-c12: tasks 32-63: Exited with exit code 1
srun: error: s4-203-c16: tasks 160-165,167-191: Exited with exit code 1
srun: error: s4-203-c16: task 166: Segmentation fault
NickSzapiro-NOAA commented 3 months ago

Was cpld_control_nowave_noaero_p8 running on S4 before? Can you check if there is any more error information in FV3_RT/rt_334524/cpld_control_nowave_noaero_p8_intel/PET*.ESMF_LogFile or ufs-weather-model/tests/logs/logs_S4/ ?

InnocentSouopgui-NOAA commented 3 months ago

Yes cpld_control_nowave_noaero_p8 was on S4 before, I have a baseline created in June that includes cpld_control_nowave_noaero_p8, and that look fine. The out file shows a 24 hours run, and the err file show no error.

Asfor the one, I am trying now, there is no error in PET*ESMF_Logfile or ufs-weather-model/tests/logs/logs_S4 ... The out file is just a couple of line, let me include its content below.

FV3_RT/rt_334524/cpld_control_nowave_noaero_p8_intel/out

Model started:   Mon Aug 5 16:40:38 GMT 2024
  0: [0] MPI startup(): I_MPI_DAPL_UD variable has been removed from the product, its value is ignored
  0:
  0:
  0:
  0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
  0:      PROGRAM ufs-weather-model HAS BEGUN. COMPILED       0.00     ORG: np23
  0:      STARTING DATE-TIME  AUG 05,2024  16:41:06.669  218  MON   2460528
  0:
  0:
  0: MPI Library = Intel(R) MPI Library 2021.5 for Linux* OS
  0:
  0: MPI Version = 3.1
TOTCPU=01:58:24 ELAP=00:00:37 REQMEM=5875Mc REQCPUS=192 ALLOCCPUS=192 TIMELIMIT=00:30:00 PART=s4 ACCT=star
MAXRSS=100284K MAXVMSIZE=5576760K
________________________________________________________________
Job Resource Usage Summary for 29110319

  CPU Time Used : 01:58:24
  Memory Used : 100284K
  Virtual Memory Used : 5576760K
  Walltime Used : 00:00:37

  Memory Requested : 5875Mc (n=per node; c=per core)
  CPUs Requested / Allocated : 192 / 192
  Walltime Requested : 00:30:00

  Execution Queue : s4
  Head Node :
  Charged to : star

  Job Stopped : Mon Aug  5 16:41:13 GMT 2024
_____________________________________________________________________
NickSzapiro-NOAA commented 3 months ago

I'm confused...from 4 days ago I understood that cpld_control_nowave_noaero_p8_intel runs on S4 but atm_ds2s_docn_dice failed as nco/ncrcat didn't module load properly.

Now, cpld_control_nowave_noaero_p8intel compiles but does not run and you cannot find any errors. I would expect some error logged in a PET{150-190}.ESMF_LogFile.

Am I following correctly? It is difficult to help without error/log information

DusanJovic-NOAA commented 3 months ago

It might be worth trying to run DEBUG build first. Add -DDEBUG=ON to the preceding COMPILE job.

InnocentSouopgui-NOAA commented 3 months ago

I'm confused...from 4 days ago I understood that cpld_control_nowave_noaero_p8_intel runs on S4 but atm_ds2s_docn_dice failed as nco/ncrcat didn't module load properly.

Now, cpld_control_nowave_noaero_p8intel compiles but does not run and you cannot find any errors. I would expect some error logged in a PET{150-190}.ESMF_LogFile.

Am I following correctly? It is difficult to help without error/log information

Yes, you are right, there is some confusion. Four days ago, when I reported that cpld_control_nowave_noaero_p8_intel run sucesfuly, I was looking only at the ouput from ecflow. At the time, atm_ds2s_docn_dice failed on loading nco. After solving the problem of 'NCO', atm_ds2s_docn_dice failed on missing files from cpld_control_nowave_noaero_p8_intel; that prompted me to look deeper.

I hope that clears a little bit the confusion.

NickSzapiro-NOAA commented 3 months ago

Thanks for looking deeper, @InnocentSouopgui-NOAA. Then, without more error information, would it be possible for you to run cpld_control_nowave_noaero_p8_intel on S4 with DEBUG as Dusan suggested from current ufs-weather-model/develop?

InnocentSouopgui-NOAA commented 2 months ago

Thanks for looking deeper, @InnocentSouopgui-NOAA. Then, without more error information, would it be possible for you to run cpld_control_nowave_noaero_p8_intel on S4 with DEBUG as Dusan suggested from current ufs-weather-model/develop?

I ran cpld_control_nowave_noaero_p8_intel on S4 with DEBUG, and it succeeded and atm_ds2s_docn_dice succeeded after that as well. cpld_control_nowave_noaero_p8_intel took way longer and I had to increase the requested wall clock time.

Possibly Optimization is not working well on S4 for this case. What can we do for that?

NickSzapiro-NOAA commented 2 months ago

Great that both ran! Can I ask two questions first:

  1. Did you use the same ufs-weather-model hash for the successful DEBUG and previously failing tests?
  2. Do all of the other tests in rt.conf run on S4?
InnocentSouopgui-NOAA commented 2 months ago
  1. Did you use the same ufs-weather-model hash for the successful DEBUG and previously failing tests?

Yes, with the exact same clone, compiling s2sw with DEBUG ON, cpld_control_nowave_noaero_p8_intel is successful, without DEBUG, it fails leading to the failure of atm_ds2s_docn_dice

  1. Do all of the other tests in rt.conf run on S4?

Yes I created baselines for all other tests in rt.conf last week. The failure of atm_ds2s_docn_dice (the only to be reported as failed by ecflow) was what made me to open the issue.

NickSzapiro-NOAA commented 2 months ago

Can we try to dig into why the cpld_control_nowave_noaero_p8_intel (without DEBUG) is failing? Sorry that I don't have access to S4 to run tests myself.

The MPI_Abort (on tasks 150-190) is usually done by ESMF. If so, there must be some log info printed in the PET files. Can you grep -i error PET*?

We should ID which component is failing. In run_dir/cpld_control_nowave_noaero_p8_intel/ufs.configure, you can find the component(s) for these 150-190 tasks/PETs

DeniseWorthen commented 2 months ago

I think it would also be useful to know exactly which component runs on the tasks which are showing the MPI abort. I'm guessing it is MOM6, but can you post your ufs.configure for this test?

InnocentSouopgui-NOAA commented 2 months ago

ufs.configure.txt Attached is the ufs.configure

NickSzapiro-NOAA commented 2 months ago

Thanks for the ufs.configure. It looks like the abort is from a CICE or MOM6 PET. Is there any error information in the ice_diag log?

fyi, I'm "unassigning" myself from this issue and adding a request for epic support as they are responsible for platform support. I can help as needed

InnocentSouopgui-NOAA commented 2 months ago

@NickSzapiro-NOAA, there is no ice_diag.d when the run failed. So I guess it was not getting to the place where that file is created.

InnocentSouopgui-NOAA commented 2 months ago

I cloned ufs model today and all tests run successfully on S4. I was able to create all baselines in rt.conf. Whatever the problem was, latest developments fixed it.

So this issue can be closed.