Closed jiandewang closed 1 year ago
@jiandewang Thanks for this report. @FernandoAndrade-NOAA @zach1221 What happened here? I can see in the Hera Log that the control_CubedSphereGrid_parallel_intel
initially failed due to missing baseline files
baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel
working dir = /scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_19500/control_CubedSphereGrid_parallel_intel
Checking test 028 control_CubedSphereGrid_parallel_intel results ....
Comparing sfcf000.nc ............ALT CHECK......NOT OK
Comparing sfcf024.nc ............ALT CHECK......NOT OK
Comparing atmf000.nc ............ALT CHECK......NOT OK
Comparing atmf024.nc ............ALT CHECK......NOT OK
Comparing cubed_sphere_grid_sfcf000.nc ............MISSING baseline
Comparing cubed_sphere_grid_sfcf024.nc ............MISSING baseline
Comparing cubed_sphere_grid_atmf000.nc ............MISSING baseline
Comparing cubed_sphere_grid_atmf024.nc ............MISSING baseline
Comparing GFSFLX.GrbF00 ............MISSING baseline
Comparing GFSFLX.GrbF24 ............MISSING baseline
Comparing GFSPRS.GrbF00 ............MISSING baseline
Comparing GFSPRS.GrbF24 ............MISSING baseline
0: The total amount of wall time = 137.435939
0: The maximum resident set size (KB) = 632588
Test 028 control_CubedSphereGrid_parallel_intel FAIL Tries: 2
It appears it was run a second time, with alt-checks passing, but the needed files are not in the baseline directory?
Testing UFSWM Hash: 7f27783c178094d9795d055a4ff20abfc450066a
Testing With Submodule Hashes:
37cbb7d6840ae7515a9a8f0dfd4d89461b3396d1 ../AQM (v0.2.0-37-g37cbb7d)
2aa6bfbb62ebeecd7da964b8074f6c3c41c7d1eb ../CDEPS-interface/CDEPS (cdeps0.4.17-38-g2aa6bfb)
5840cd1931e2e32b9dfded0c19049d0f1ec3d04c ../CICE-interface/CICE (CICE6.0.0-440-g5840cd1)
9923d6d17700daf502d9a016138bf8eb8aad7f09 ../CMEPS-interface/CMEPS (cmeps_v0.4.1-1402-g9923d6d)
cabd7753ae17f7bfcc6dad56daf10868aa51c3f4 ../CMakeModules (v1.0.0-28-gcabd775)
4b88c9e37c3f93baf7b17ec2512e51b4cac7a3c8 ../FV3 (remotes/origin/cubed_sphere_history_output)
b94145fca46169bbc53ec6b8d4ed849715dc5130 ../GOCART (rt-v5_29_1_BPL91_1-exRT4-514-gb94145f)
24437531dcf8580aadaf6ebeb9de544ccfc674f9 ../HYCOM-interface/HYCOM (2.3.00-120-g2443753)
fdbfa2523650b81a0771f3fb1791ea3e3dce66db ../MOM6-interface/MOM6 (dev/master/repository_split_2014.10.10-9713-gfdbfa2523)
569e354ababbde7a7cd68647533769a5c966468d ../NOAHMP-interface/noahmp (v3.7.1-303-g569e354)
59c554a12df3a04e0402ce5f17bb32cbbac193b2 ../WW3 (6.07.1-341-g59c554a1)
3bfa4468d85e5b63980c28434f494967f38b10a3 ../stochastic_physics (ufs-v2.0.0-171-g3bfa446)
Compile atm_dyn32_intel elapsed time 621 seconds. -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v16,FV3_GFS_v16_flake,FV3_GFS_v17_p8,FV3_GFS_v17_p8_rrtmgp,FV3_GFS_v15_thompson_mynn_lam3km,FV3_WoFS_v0,FV3_GFS_v17_p8_mynn -D32BIT=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release
baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel
working dir = /scratch1/NCEPDEV/stmp2/Zachary.Shrader/FV3_RT/rt_230889/control_CubedSphereGrid_parallel_intel
Checking test 001 control_CubedSphereGrid_parallel_intel results ....
Comparing sfcf000.nc ............ALT CHECK......OK
Comparing sfcf024.nc ............ALT CHECK......OK
Comparing atmf000.nc .........OK
Comparing atmf024.nc .........OK
Comparing cubed_sphere_grid_sfcf000.nc ............ALT CHECK......OK
Comparing cubed_sphere_grid_sfcf024.nc ............ALT CHECK......OK
Comparing cubed_sphere_grid_atmf000.nc .........OK
Comparing cubed_sphere_grid_atmf024.nc ............ALT CHECK......OK
Comparing GFSFLX.GrbF00 .........OK
Comparing GFSFLX.GrbF24 .........OK
Comparing GFSPRS.GrbF00 .........OK
Comparing GFSPRS.GrbF24 .........OK
0: The total amount of wall time = 140.281668
0: The maximum resident set size (KB) = 636936
Test 001 control_CubedSphereGrid_parallel_intel PASS
How did this test find the files to compare against, since as @jiandewang reports, only 4 files are in the baseline? I can't make sense of this.
we need to avoid this kind of thing happening again. When I tested something new, I have to make assumption that current develop branch and its baseline are perfectly correct.
I see this directory in the baselines:
$ ls -l /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/_develop-20230728/control_CubedSphereGrid_parallel_intel
total 1961212
-rw-r--r-- 1 emc.nemspara nems 2735393 Jul 31 12:09 GFSFLX.GrbF00
-rw-r--r-- 1 emc.nemspara nems 6705960 Jul 31 12:09 GFSFLX.GrbF24
-rw-r--r-- 1 emc.nemspara nems 67300619 Jul 31 12:09 GFSPRS.GrbF00
-rw-r--r-- 1 emc.nemspara nems 66822536 Jul 31 12:09 GFSPRS.GrbF24
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 31 12:09 atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 31 12:09 atmf024.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 31 12:09 cubed_sphere_grid_atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 31 12:09 cubed_sphere_grid_atmf024.nc
-rw-r--r-- 1 emc.nemspara nems 35461854 Jul 31 12:09 cubed_sphere_grid_sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems 35461854 Jul 31 12:09 cubed_sphere_grid_sfcf024.nc
-rw-r--r-- 1 emc.nemspara nems 46632697 Jul 31 12:09 sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems 46632697 Jul 31 12:09 sfcf024.nc
which contains new baseline files needed for control_CubedSphereGrid_parallel test. But notice the underscore in front of develop-20230728.
There is also this directory:
$ ll /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/control_CubedSphereGrid_parallel_inteltotal 1961224
-rw-r--r-- 1 emc.nemspara nems 2735393 Jul 29 01:02 GFSFLX.GrbF00
-rw-r--r-- 1 emc.nemspara nems 6705960 Jul 29 01:02 GFSFLX.GrbF24
-rw-r--r-- 1 emc.nemspara nems 67300619 Jul 29 01:02 GFSPRS.GrbF00
-rw-r--r-- 1 emc.nemspara nems 66822536 Jul 29 01:02 GFSPRS.GrbF24
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 29 01:02 atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 29 01:02 atmf024.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 29 01:02 cubed_sphere_grid_atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 29 01:02 cubed_sphere_grid_atmf024.nc
-rw-r--r-- 1 emc.nemspara nems 35461854 Jul 29 01:02 cubed_sphere_grid_sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems 35461854 Jul 29 01:02 cubed_sphere_grid_sfcf024.nc
-rw-r--r-- 1 emc.nemspara nems 46632697 Jul 29 01:02 sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems 46632697 Jul 29 01:02 sfcf024.nc
which should not be there.
Something is wrong here.
@zach1221 Sounds like I copied the new baseline one directory up. I moved to correct location. Can you double check how the case runs with develop branch?
@zach1221 @FernandoAndrade-NOAA BTW, do we see timeout issue to build atm_dyn32_intel on hera?
@zach1221 @FernandoAndrade-NOAA we need to move hera baseline location to /scratch2/NAGAPE/epic/UFS-WM_RT.
@jkbk2004 sure I would be happy to doublecheck on develop. @DeniseWorthen I was covering for Fernando and noticed his first attempt failed to match, so I recreated the control_CubedSphereGrid_parallel_intel baselines and requested they be recopied. The baselines files then said they matched on the second attempt.
@zach1221 Thanks, but I'm still confused. When I looked this morning, the baseline contained 4 files, as Jiande reported. But your part of the hera log shows you compared against that baseline
baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel
How were the files present when you compared, but absent when Jiande ran his test?
@DeniseWorthen I'm sorry, I don't know. @jkbk2004 control_CubedSphereGrid_parallel_intel passed develop.
@jiandewang control_CubedSphereGrid_parallel_intel develop-20230728 is recovered ok on hera now. Sorry about the interruption. Let me know if you continue to see the issue.
@jkbk2004 I lack confidence this baseline was correctly created and tested against on all RDHPCS platforms.
@DeniseWorthen sorry but @zach1221 confirms. I don't see any issue.
@jkbk2004 @zach1221 cannot explain how he compared against a baseline and got a pass on hera when the baseline at the time contained only 4 files. The baseline also shows that it was created 2 days before anything else
drwxr-sr-x 2 emc.nemspara nems 4096 Jul 29 01:02 control_CubedSphereGrid_parallel_intel
@jkbk2004 @zach1221 cannot explain how he compared against a baseline and got a pass on hera when the baseline at the time contained only 4 files.
That's why @zach1221 re-confirmed the develop passed ok with July 29 files. It could be the directory moved after test somehow. But I don't remember. Any issue?
I just re-ran that test on HERA and it passed cmp. I ran rt.sh on ORION last night and had no issue on this test case, but I haven't run anything on other platform. Since my ticket is for HERA, I am going to close this one but I think it's better for someone to re-confirm this case on other platforms for safety purpose.
@jiandewang I can re-confirm on Jet, and Gaea.
I can confirm Jet and Gaea are good. @jiandewang
Description
on HERA baseline directory /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel it only has the following 4 dataset -rw-r--r-- 1 emc.nemspara nems 366572475 Jul 31 09:18 atmf000.nc -rw-r--r-- 1 emc.nemspara nems 366572475 Jul 31 09:18 atmf024.nc -rw-r--r-- 1 emc.nemspara nems 35455046 Jul 31 09:18 sfcf000.nc -rw-r--r-- 1 emc.nemspara nems 35455046 Jul 31 09:18 sfcf024.nc it doesn't have cubed_sphere_grid_sfcf.nc and GFSPRS.GrbF which are needed for final cmp when job is done
also run rt.sh for this test will fail even just for the comparison of the 4 existing files Comparing sfcf000.nc ............ALT CHECK......NOT OK Comparing sfcf024.nc ............ALT CHECK......NOT OK Comparing atmf000.nc ............ALT CHECK......NOT OK Comparing atmf024.nc ............ALT CHECK......NOT OK
To Reproduce:
checkout the latset UWM, run rt.sh for that test and see the run log
Additional context
Output