ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

baseline check is not working as expected on Hercules #2245

Closed uturuncoglu closed 3 weeks ago

uturuncoglu commented 3 weeks ago

Description

I am trying to compare a set of netcdf files for the regression test defined in ufs-coastal. This is CDEPS data atmosphere coupled ROMS configuration and produces three netcdf files. The output of the regression test is something like following on Hercules,

baseline dir = /work2/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240417/coastal_irene_atm2roms_intel
working dir  = /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_2701396/coastal_irene_atm2roms_intel
Checking test coastal_irene_atm2roms_intel results ....
 Comparing irene_avg.nc .....USING NCCMP......NOT IDENTICAL
 Comparing irene_his.nc .....USING NCCMP......NOT IDENTICAL
 Comparing irene_rst.nc .....USING NCCMP......NOT IDENTICAL

 0: The total amount of wall time                        = 246.928419
 0: The maximum resident set size (KB)                   = 268644

Test coastal_irene_atm2roms_intel FAIL Tries: 2

It indicates that the test is failed in the step of baseline configuration. Actually, if I run the nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format /work2/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240417/coastal_irene_atm2roms_intel/irene_his.nc /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_2701396/coastal_irene_atm2roms_intel/irene_his.nc > log 2>&1 && d=$? || d=$?; echo $d command manually. The log file is empty but $d has value of 1 and the regression testing thinks that the test is failed. I also compared the files with NCAR's cprnc tool and it seems that the files are identical.

SUMMARY of cprnc:
 A total number of    307 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
               and      0 had different data types
 A total number of      0 fields could not be analyzed
 A total number of      0 time-varying fields on file 1 were not found on file 2.
 A total number of      0 time-constant fields on file 1 were not found on file 2.
 A total number of      0 time-varying fields on file 2 were not found on file 1.
 A total number of      0 time-constant fields on file 2 were not found on file 1.
  diff_test: the two files seem to be IDENTICAL 

So, I am not sure why but rt_utils.sh thinks that the files are not identical. Any suggestion? Is this a bug? Since the script is used by multiple tests and seems robust but I am not sure. There could be still issue with the RT baseline check step.

I also test this on Frontera and got similar results (https://github.com/oceanmodeling/roms/issues/3) but of course that is not a officially supported Teir 1 platform and also with little bit old version of model (maybe not using nccmp).

To Reproduce:

This can be reproduced on Hercules using ufs-coastal.

  1. checkout ufs-coastal: git clone -b feature/coastal_app --recursive https://github.com/oceanmodeling/ufs-coastal.git
  2. cd ufs-coastal/tests
  3. run RTs: ./rt.sh -l rt_coastal.conf -a nems -e since there is a bug in rt.sh (https://github.com/ufs-community/ufs-weather-model/issues/2244) there is n o way to run single test like coastal_irene_atm2roms but rt_coastal.conf can be edited to keep only coastal_irene_atm2roms.

Additional context

None

Output

None

uturuncoglu commented 3 weeks ago

Let me test this on Derecho. I'll update you about it.

uturuncoglu commented 3 weeks ago

Hercules, need to check the permissions for the baseline files.

uturuncoglu commented 3 weeks ago

@DusanJovic-NOAA I double check and I think that permissions are fine. Can you try to read the files in /work2/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240126/coastal_irene_atm2roms_intel or /work2/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240417/coastal_irene_atm2roms_intel and let me know. If you could not which level you could see.

uturuncoglu commented 3 weeks ago

Might be also related following closed issue - https://github.com/ufs-community/ufs-weather-model/issues/2015

DusanJovic-NOAA commented 3 weeks ago

I see the differences in the compiler_flags global attribute:

$ nccmp -g /work2/noaa/stmp/tufuk/stmp/tufuk/FV3_RT/rt_2770508/coastal_irene_atm2roms_intel/irene_avg.nc /work2/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240126/coastal_irene_atm2roms_intel/irene_avg.nc
DIFFER : LENGTHS OF GLOBAL ATTRIBUTE : compiler_flags : 223 <> 193 : VALUES :  -g -traceback -fpp -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -sox -align array64byte -qno-opt-dynamic-align -diag-disable 5462 -diag-disable 7712 -real-size 64 -fp-model precise -ip -O3 -traceback -check uninit <>  -g -traceback -fpp -fno-alias -auto -safe-cray-ptr -ftz -assume byterecl -nowarn -sox -align array64byte -qno-opt-dynamic-align -real-size 64 -fp-model precise -ip -O3 -traceback -check uninit

If I'm looking at correct output files.

uturuncoglu commented 3 weeks ago

@DusanJovic-NOAA Thanks for checking. That is really helpful. I am not sure why I am not seeing this in the nccmp output. If this is the case, since these are ROMS global attributes and related with the compile flags, the baseline needs to be created again even if the data itself are fine. I think there is also way to check just data not the attributes but I am not sure that is the way that we need to go. Let me check the create baseline and check again on Hercules. Thanks again for your help.

DusanJovic-NOAA commented 3 weeks ago

You are not seeing the differences because of -q (quiet) flag. Without -q the stdout will be huge in case the files are actually different so we rely on the error code to determine if the files are actually different.

uturuncoglu commented 3 weeks ago

@DusanJovic-NOAA Thanks. It is good to know. I added export CMP_DATAONLY=true to the test file and run again and it is passing now. I think I could close this issue. Thanks again for your help.