ufs-community / ufs-weather-model

UFS Weather Model
Other
139 stars 247 forks source link

missing baseline data for control_CubedSphereGrid_parallel_intel on HERA #1847

Closed jiandewang closed 1 year ago

jiandewang commented 1 year ago

Description

on HERA baseline directory /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel it only has the following 4 dataset -rw-r--r-- 1 emc.nemspara nems 366572475 Jul 31 09:18 atmf000.nc -rw-r--r-- 1 emc.nemspara nems 366572475 Jul 31 09:18 atmf024.nc -rw-r--r-- 1 emc.nemspara nems 35455046 Jul 31 09:18 sfcf000.nc -rw-r--r-- 1 emc.nemspara nems 35455046 Jul 31 09:18 sfcf024.nc it doesn't have cubed_sphere_grid_sfcf.nc and GFSPRS.GrbF which are needed for final cmp when job is done

also run rt.sh for this test will fail even just for the comparison of the 4 existing files Comparing sfcf000.nc ............ALT CHECK......NOT OK Comparing sfcf024.nc ............ALT CHECK......NOT OK Comparing atmf000.nc ............ALT CHECK......NOT OK Comparing atmf024.nc ............ALT CHECK......NOT OK

To Reproduce:

checkout the latset UWM, run rt.sh for that test and see the run log

Additional context

Output

DeniseWorthen commented 1 year ago

@jiandewang Thanks for this report. @FernandoAndrade-NOAA @zach1221 What happened here? I can see in the Hera Log that the control_CubedSphereGrid_parallel_intel initially failed due to missing baseline files

baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel
working dir  = /scratch1/NCEPDEV/stmp2/Fernando.Andrade-maldonado/FV3_RT/rt_19500/control_CubedSphereGrid_parallel_intel
Checking test 028 control_CubedSphereGrid_parallel_intel results ....
 Comparing sfcf000.nc ............ALT CHECK......NOT OK
 Comparing sfcf024.nc ............ALT CHECK......NOT OK
 Comparing atmf000.nc ............ALT CHECK......NOT OK
 Comparing atmf024.nc ............ALT CHECK......NOT OK
 Comparing cubed_sphere_grid_sfcf000.nc ............MISSING baseline
 Comparing cubed_sphere_grid_sfcf024.nc ............MISSING baseline
 Comparing cubed_sphere_grid_atmf000.nc ............MISSING baseline
 Comparing cubed_sphere_grid_atmf024.nc ............MISSING baseline
 Comparing GFSFLX.GrbF00 ............MISSING baseline
 Comparing GFSFLX.GrbF24 ............MISSING baseline
 Comparing GFSPRS.GrbF00 ............MISSING baseline
 Comparing GFSPRS.GrbF24 ............MISSING baseline

  0: The total amount of wall time                        = 137.435939
  0: The maximum resident set size (KB)                   = 632588

Test 028 control_CubedSphereGrid_parallel_intel FAIL Tries: 2

It appears it was run a second time, with alt-checks passing, but the needed files are not in the baseline directory?

Testing UFSWM Hash: 7f27783c178094d9795d055a4ff20abfc450066a
Testing With Submodule Hashes:
 37cbb7d6840ae7515a9a8f0dfd4d89461b3396d1 ../AQM (v0.2.0-37-g37cbb7d)
 2aa6bfbb62ebeecd7da964b8074f6c3c41c7d1eb ../CDEPS-interface/CDEPS (cdeps0.4.17-38-g2aa6bfb)
 5840cd1931e2e32b9dfded0c19049d0f1ec3d04c ../CICE-interface/CICE (CICE6.0.0-440-g5840cd1)
 9923d6d17700daf502d9a016138bf8eb8aad7f09 ../CMEPS-interface/CMEPS (cmeps_v0.4.1-1402-g9923d6d)
 cabd7753ae17f7bfcc6dad56daf10868aa51c3f4 ../CMakeModules (v1.0.0-28-gcabd775)
 4b88c9e37c3f93baf7b17ec2512e51b4cac7a3c8 ../FV3 (remotes/origin/cubed_sphere_history_output)
 b94145fca46169bbc53ec6b8d4ed849715dc5130 ../GOCART (rt-v5_29_1_BPL91_1-exRT4-514-gb94145f)
 24437531dcf8580aadaf6ebeb9de544ccfc674f9 ../HYCOM-interface/HYCOM (2.3.00-120-g2443753)
 fdbfa2523650b81a0771f3fb1791ea3e3dce66db ../MOM6-interface/MOM6 (dev/master/repository_split_2014.10.10-9713-gfdbfa2523)
 569e354ababbde7a7cd68647533769a5c966468d ../NOAHMP-interface/noahmp (v3.7.1-303-g569e354)
 59c554a12df3a04e0402ce5f17bb32cbbac193b2 ../WW3 (6.07.1-341-g59c554a1)
 3bfa4468d85e5b63980c28434f494967f38b10a3 ../stochastic_physics (ufs-v2.0.0-171-g3bfa446)
Compile atm_dyn32_intel elapsed time 621 seconds. -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v16,FV3_GFS_v16_flake,FV3_GFS_v17_p8,FV3_GFS_v17_p8_rrtmgp,FV3_GFS_v15_thompson_mynn_lam3km,FV3_WoFS_v0,FV3_GFS_v17_p8_mynn -D32BIT=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release

baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel
working dir  = /scratch1/NCEPDEV/stmp2/Zachary.Shrader/FV3_RT/rt_230889/control_CubedSphereGrid_parallel_intel
Checking test 001 control_CubedSphereGrid_parallel_intel results ....
 Comparing sfcf000.nc ............ALT CHECK......OK
 Comparing sfcf024.nc ............ALT CHECK......OK
 Comparing atmf000.nc .........OK
 Comparing atmf024.nc .........OK
 Comparing cubed_sphere_grid_sfcf000.nc ............ALT CHECK......OK
 Comparing cubed_sphere_grid_sfcf024.nc ............ALT CHECK......OK
 Comparing cubed_sphere_grid_atmf000.nc .........OK
 Comparing cubed_sphere_grid_atmf024.nc ............ALT CHECK......OK
 Comparing GFSFLX.GrbF00 .........OK
 Comparing GFSFLX.GrbF24 .........OK
 Comparing GFSPRS.GrbF00 .........OK
 Comparing GFSPRS.GrbF24 .........OK

  0: The total amount of wall time                        = 140.281668
  0: The maximum resident set size (KB)                   = 636936

Test 001 control_CubedSphereGrid_parallel_intel PASS

How did this test find the files to compare against, since as @jiandewang reports, only 4 files are in the baseline? I can't make sense of this.

jiandewang commented 1 year ago

we need to avoid this kind of thing happening again. When I tested something new, I have to make assumption that current develop branch and its baseline are perfectly correct.

DusanJovic-NOAA commented 1 year ago

I see this directory in the baselines:

$ ls -l /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/_develop-20230728/control_CubedSphereGrid_parallel_intel
total 1961212
-rw-r--r-- 1 emc.nemspara nems   2735393 Jul 31 12:09 GFSFLX.GrbF00
-rw-r--r-- 1 emc.nemspara nems   6705960 Jul 31 12:09 GFSFLX.GrbF24
-rw-r--r-- 1 emc.nemspara nems  67300619 Jul 31 12:09 GFSPRS.GrbF00
-rw-r--r-- 1 emc.nemspara nems  66822536 Jul 31 12:09 GFSPRS.GrbF24
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 31 12:09 atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 31 12:09 atmf024.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 31 12:09 cubed_sphere_grid_atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 31 12:09 cubed_sphere_grid_atmf024.nc
-rw-r--r-- 1 emc.nemspara nems  35461854 Jul 31 12:09 cubed_sphere_grid_sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems  35461854 Jul 31 12:09 cubed_sphere_grid_sfcf024.nc
-rw-r--r-- 1 emc.nemspara nems  46632697 Jul 31 12:09 sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems  46632697 Jul 31 12:09 sfcf024.nc

which contains new baseline files needed for control_CubedSphereGrid_parallel test. But notice the underscore in front of develop-20230728.

There is also this directory:

$ ll /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/control_CubedSphereGrid_parallel_inteltotal 1961224
-rw-r--r-- 1 emc.nemspara nems   2735393 Jul 29 01:02 GFSFLX.GrbF00
-rw-r--r-- 1 emc.nemspara nems   6705960 Jul 29 01:02 GFSFLX.GrbF24
-rw-r--r-- 1 emc.nemspara nems  67300619 Jul 29 01:02 GFSPRS.GrbF00
-rw-r--r-- 1 emc.nemspara nems  66822536 Jul 29 01:02 GFSPRS.GrbF24
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 29 01:02 atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 483651896 Jul 29 01:02 atmf024.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 29 01:02 cubed_sphere_grid_atmf000.nc
-rw-r--r-- 1 emc.nemspara nems 366572654 Jul 29 01:02 cubed_sphere_grid_atmf024.nc
-rw-r--r-- 1 emc.nemspara nems  35461854 Jul 29 01:02 cubed_sphere_grid_sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems  35461854 Jul 29 01:02 cubed_sphere_grid_sfcf024.nc
-rw-r--r-- 1 emc.nemspara nems  46632697 Jul 29 01:02 sfcf000.nc
-rw-r--r-- 1 emc.nemspara nems  46632697 Jul 29 01:02 sfcf024.nc

which should not be there.

Something is wrong here.

jkbk2004 commented 1 year ago

@zach1221 Sounds like I copied the new baseline one directory up. I moved to correct location. Can you double check how the case runs with develop branch?

jkbk2004 commented 1 year ago

@zach1221 @FernandoAndrade-NOAA BTW, do we see timeout issue to build atm_dyn32_intel on hera?

jkbk2004 commented 1 year ago

@zach1221 @FernandoAndrade-NOAA we need to move hera baseline location to /scratch2/NAGAPE/epic/UFS-WM_RT.

zach1221 commented 1 year ago

@jkbk2004 sure I would be happy to doublecheck on develop. @DeniseWorthen I was covering for Fernando and noticed his first attempt failed to match, so I recreated the control_CubedSphereGrid_parallel_intel baselines and requested they be recopied. The baselines files then said they matched on the second attempt.

DeniseWorthen commented 1 year ago

@zach1221 Thanks, but I'm still confused. When I looked this morning, the baseline contained 4 files, as Jiande reported. But your part of the hera log shows you compared against that baseline

baseline dir = /scratch1/NCEPDEV/nems/emc.nemspara/RT/NEMSfv3gfs/develop-20230728/control_CubedSphereGrid_parallel_intel

How were the files present when you compared, but absent when Jiande ran his test?

zach1221 commented 1 year ago

@DeniseWorthen I'm sorry, I don't know. @jkbk2004 control_CubedSphereGrid_parallel_intel passed develop.

jkbk2004 commented 1 year ago

@jiandewang control_CubedSphereGrid_parallel_intel develop-20230728 is recovered ok on hera now. Sorry about the interruption. Let me know if you continue to see the issue.

DeniseWorthen commented 1 year ago

@jkbk2004 I lack confidence this baseline was correctly created and tested against on all RDHPCS platforms.

jkbk2004 commented 1 year ago

@DeniseWorthen sorry but @zach1221 confirms. I don't see any issue.

DeniseWorthen commented 1 year ago

@jkbk2004 @zach1221 cannot explain how he compared against a baseline and got a pass on hera when the baseline at the time contained only 4 files. The baseline also shows that it was created 2 days before anything else

drwxr-sr-x 2 emc.nemspara nems 4096 Jul 29 01:02 control_CubedSphereGrid_parallel_intel

jkbk2004 commented 1 year ago

@jkbk2004 @zach1221 cannot explain how he compared against a baseline and got a pass on hera when the baseline at the time contained only 4 files.

That's why @zach1221 re-confirmed the develop passed ok with July 29 files. It could be the directory moved after test somehow. But I don't remember. Any issue?

jiandewang commented 1 year ago

I just re-ran that test on HERA and it passed cmp. I ran rt.sh on ORION last night and had no issue on this test case, but I haven't run anything on other platform. Since my ticket is for HERA, I am going to close this one but I think it's better for someone to re-confirm this case on other platforms for safety purpose.

zach1221 commented 1 year ago

@jiandewang I can re-confirm on Jet, and Gaea.

zach1221 commented 1 year ago

I can confirm Jet and Gaea are good. @jiandewang