ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

multiple netcdf_parallel tests fail on hercules #2015

Closed DeniseWorthen closed 8 months ago

DeniseWorthen commented 9 months ago

Description

The control_wrtGauss_netcdf_parallel_intel fails on hercules (intel) with the following error:

2023-11-29 11:42:48.544357 +0000 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-s\
tage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error

which results in

 Comparing atmf024.nc ............ALT CHECK......ERROR

To Reproduce:

Attempt to run this test on Hercules, check the atmf024.nc_nccmp.log file in the run directory for the error message.

Additional context

The issue was first noted in PR #1990.

Output

climbfuji commented 9 months ago

One question I have is whether that file in question was created with a "faulty" version of hdf5 and if so whether that file needs to be recreated as well. Can we try to open a known "good" file that was created on Orion, for example?

DeniseWorthen commented 9 months ago

From what I can tell, the last time that the file was compared ok during testing was against the develop-20231117 baseline. That PR (#1967) did not change baselines for this test, and when the RTs were run, the file compared OK.

edit--sorry, I probably mis-understood dom's suggestion for a 'good' comparison.

climbfuji commented 9 months ago

From what I can tell, the last time that the file was compared ok during testing was against the develop-20231117 baseline. That PR (#1967) did not change baselines for this test, and when the RTs were run, the file compared OK.

https://github.com/ufs-community/ufs-weather-model/pull/1990#issuecomment-1826824914

https://github.com/ufs-community/ufs-weather-model/pull/1990#issuecomment-1827004466

https://github.com/ufs-community/ufs-weather-model/pull/1990#issuecomment-1827876841

?

DeniseWorthen commented 9 months ago

Yes, #1967 was the previous commit. In #1990, it was noted for the first time. My assumption is that the changes in 1990 is triggering the error, which we're not seeing on other platforms because of hercules-specific hdf5/nccmp issues?

It is, I agree, intermittent, which adds another layer of complexity. You'll note above I got the same error on multiple tests, with different files not comparing each time.

climbfuji commented 9 months ago

I rebuilt both spack-stack-1.5.0 and spack-stack-1.5.1 on Hercules over the weekend. I created a baseline with the current develop branch of ufs-weather-model using spack-stack-1.5.0 here:

/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel

When I verify against this baseline, I get in my first try:

$ cat fail_test
control_wrtGauss_netcdf_parallel_intel 034 failed in run_test
control_wrtGauss_netcdf_parallel_debug_intel 085 failed in run_test
130 hafs_regional_atm_intel failed in check_result
hafs_regional_atm_intel 130 failed in run_test

with:

$ cat logs/log_hercules/rt_034_control_wrtGauss_netcdf_parallel_intel.log

baseline dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_intel
working dir  = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_intel
Checking test 034 control_wrtGauss_netcdf_parallel_intel results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf024.nc ............ALT CHECK......OK
 Comparing atmf000.nc ............ALT CHECK......ERROR

and

$ cat logs/log_hercules/rt_085_control_wrtGauss_netcdf_parallel_debug_intel.log

baseline dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel
working dir  = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel
Checking test 085 control_wrtGauss_netcdf_parallel_debug_intel results ....
 Comparing sfcf000.nc .........OK
 Comparing sfcf001.nc .........OK
 Comparing atmf000.nc ............ALT CHECK......OK
 Comparing atmf001.nc ............ALT CHECK......ERROR

and

$ cat logs/log_hercules/rt_130_hafs_regional_atm_intel.log

baseline dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/hafs_regional_atm_intel
working dir  = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/hafs_regional_atm_intel
Checking test 130 hafs_regional_atm_intel results ....
 Comparing atmf006.nc .........OK
 Comparing sfcf006.nc .........OK
 Comparing HURPRS.GrbF06 .........NOT OK

  0: The total amount of wall time                        = 289.983108
  0: The maximum resident set size (KB)                   = 873280

Test 130 hafs_regional_atm_intel FAIL Tries: 2

So all of these fail to verify a file, but each of them has a different "exit code" reporting a different failure in rt.sh. That's quite confusing. (As a side note, I also noted that when one runs compile.sh, there's something in the scripts that pause the script somewhere after loading the modules. One has to hit enter to get a prompt and then issue fg to wake up the process.)

@DusanJovic-NOAA @DeniseWorthen - the failure for test 130 is for a GrbF06 file, not for a netCDF file? The other two are netCDF files.

climbfuji commented 9 months ago
$ nccmp -m /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
[dheinzel@hercules-login-4 tests]$ nccmp -d /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
2023-12-11 08:11:07.315813 -0600 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/cache/build_stage/spack-stage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3449 NetCDF: HDF error
DeniseWorthen commented 9 months ago

@climbfuji Yes, the GrbF06 is a grib file from running post. I haven't been seeing that particular test fail in cases where I did see the parallel-netcdf tests fail.

DusanJovic-NOAA commented 9 months ago
$ nccmp -m /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
[dheinzel@hercules-login-4 tests]$ nccmp -d /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
2023-12-11 08:11:07.315813 -0600 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/cache/build_stage/spack-stage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3449 NetCDF: HDF error

Can you run h5dump to see if we get the same internal error: HDF5: infinite loop closing library

climbfuji commented 9 months ago

For the file in the baseline directory, it's ok:

> h5dump /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
...
      ATTRIBUTE "units" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "m/sec"
         }
      }
   }
}
}

For the newly created file, it seems to be the same case, but it's still printing to stdout so I'll let it run (did the infinite loop error happen immediately when h5dump was launched?)

DusanJovic-NOAA commented 9 months ago

Try to redirect stdout to /dev/null to see if there are any errors

climbfuji commented 9 months ago

It ran to completion:

      ATTRIBUTE "units" {
         DATATYPE  H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "m/sec"
         }
      }
   }
}
}
climbfuji commented 9 months ago

I wonder if there's a bug somewhere in the underlying libraries that has to do with quantization. I know, so far we've only seen this on Hercules and only with Intel, but it's a fairly new Intel compiler (2021.9.0) with a fairly new Intel MPI (2021.9.0). We have Intel 2021.10.0 on Derecho, but with cray-mpich.

climbfuji commented 9 months ago

So, I can run h5dump successfully on both files. I can also run nccdump successfully on the newly created baseline file, but if I run ncdump on the rt_ file that I want to verify against the baseline:

>ncdump /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
...
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
    89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058 ;

 clwmr =
NetCDF: HDF error
Location: file vardata.c; fcn print_rows line 478
DusanJovic-NOAA commented 9 months ago

That's what we observed. We do not get 'corrupted' file all the time. Sometimes files are written correctly.

DusanJovic-NOAA commented 9 months ago

h5dump also prints an error "h5dump error: unable to print data" when trying to dump actual data for clwmr variable:

$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump -d /clwmr /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc 
HDF5 "/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc" {
DATASET "/clwmr" {
   DATATYPE  H5T_IEEE_F32LE
   DATASPACE  SIMPLE { ( 1, 127, 190, 384 ) / ( 1, 127, 190, 384 ) }
   DATA {h5dump error: unable to print data

   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): (DATASET 0 "/time"), (DATASET 0 "/pfull"), (DATASET 0 "/grid_yt"),
      (3): (DATASET 0 "/grid_xt")
      }
   }
   ATTRIBUTE "_FillValue" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 9.99e+20
      }
   }
   ATTRIBUTE "_Netcdf4Coordinates" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 5, 3, 1, 0
      }
   }
   ATTRIBUTE "_Netcdf4Dimid" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SCALAR
      DATA {
      (0): 5
      }
   }
   ATTRIBUTE "_QuantizeBitRoundNumberOfSignificantBits" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 14
      }
   }
   ATTRIBUTE "cell_methods" {
      DATATYPE  H5T_STRING {
         STRSIZE 11;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "time: point"
      }
   }
   ATTRIBUTE "long_name" {
      DATATYPE  H5T_STRING {
         STRSIZE 24;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "cloud water mixing ratio"
      }
   }
   ATTRIBUTE "missing_value" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 9.99e+20
      }
   }
   ATTRIBUTE "output_file" {
      DATATYPE  H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "atm"
      }
   }
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "kg/kg"
      }
   }
}
}
climbfuji commented 9 months ago

h5dump also prints an error "h5dump error: unable to print data" when trying to dump actual data for clwmr variable:

$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump -d /clwmr /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc 
HDF5 "/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc" {
DATASET "/clwmr" {
   DATATYPE  H5T_IEEE_F32LE
   DATASPACE  SIMPLE { ( 1, 127, 190, 384 ) / ( 1, 127, 190, 384 ) }
   DATA {h5dump error: unable to print data

   }
   ATTRIBUTE "DIMENSION_LIST" {
      DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): (DATASET 0 "/time"), (DATASET 0 "/pfull"), (DATASET 0 "/grid_yt"),
      (3): (DATASET 0 "/grid_xt")
      }
   }
   ATTRIBUTE "_FillValue" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 9.99e+20
      }
   }
   ATTRIBUTE "_Netcdf4Coordinates" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 5, 3, 1, 0
      }
   }
   ATTRIBUTE "_Netcdf4Dimid" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SCALAR
      DATA {
      (0): 5
      }
   }
   ATTRIBUTE "_QuantizeBitRoundNumberOfSignificantBits" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 14
      }
   }
   ATTRIBUTE "cell_methods" {
      DATATYPE  H5T_STRING {
         STRSIZE 11;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "time: point"
      }
   }
   ATTRIBUTE "long_name" {
      DATATYPE  H5T_STRING {
         STRSIZE 24;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "cloud water mixing ratio"
      }
   }
   ATTRIBUTE "missing_value" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 9.99e+20
      }
   }
   ATTRIBUTE "output_file" {
      DATATYPE  H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "atm"
      }
   }
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "kg/kg"
      }
   }
}
}

What do you suggest we should do? From what we know so far this seems to be isolated to Hercules with Intel, doesn't happen for GNU and anywhere else we test.

DusanJovic-NOAA commented 9 months ago

At this moment these few tests that occasionally create unreadable files are disabled in rt.conf so that we can run rt.sh reliably on Hercules.

Is it too much trouble to try HDF5-1.14.3

https://raw.githubusercontent.com/HDFGroup/hdf5/hdf5_1_14_3/release_docs/RELEASE.txt

DeniseWorthen commented 9 months ago

@DusanJovic-NOAA I did turn off the non-debug version of control_wrtGauss_netcdf_parallel_intel but it seems that the debug version is also unreliable? Perhaps we should disable that one also in #2009 ?

climbfuji commented 9 months ago

At this moment these few tests that occasionally create unreadable files are disabled in rt.conf so that we can run rt.sh reliably on Hercules.

Is it too much trouble to try HDF5-1.14.3

https://raw.githubusercontent.com/HDFGroup/hdf5/hdf5_1_14_3/release_docs/RELEASE.txt

That would be just in time for spack-stack-1.6.0. Let me try this!

DusanJovic-NOAA commented 9 months ago

At this moment these few tests that occasionally create unreadable files are disabled in rt.conf so that we can run rt.sh reliably on Hercules. Is it too much trouble to try HDF5-1.14.3 https://raw.githubusercontent.com/HDFGroup/hdf5/hdf5_1_14_3/release_docs/RELEASE.txt

That would be just in time for spack-stack-1.6.0. Let me try this!

I don't know if you are going to rebuild everything or just hdf5, but since these tests also use deflate (zlib) maybe you can also update zlib to 1.3 from 1.2.13, although I not think zlib is the issue here.

DusanJovic-NOAA commented 9 months ago

@DusanJovic-NOAA I did turn off the non-debug version of control_wrtGauss_netcdf_parallel_intel but it seems that the debug version is also unreliable? Perhaps we should disable that one also in #2009 ?

Sure.

junwang-noaa commented 9 months ago

@DusanJovic-NOAA These two tests (control_wrtGauss_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug) are the only ones in RT using "QUANTIZE_NSD", could that be the cause?

junwang-noaa commented 9 months ago

@climbfuji @DusanJovic-NOAA @DeniseWorthen Is the plan to turn off the two tests on Hercules so that we can move forward with PR#2013 and then we will have a follow up PR to fix this?

DeniseWorthen commented 9 months ago

@junwang-noaa I had already turned off the non-debug test in my template PR. I just asked Nick to turn off the debug test in his s2sa PR.

DusanJovic-NOAA commented 9 months ago

@DusanJovic-NOAA These two tests (control_wrtGauss_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug) are the only ones in RT using "QUANTIZE_NSD", could that be the cause?

It could be. Are you suggesting to turn off quantization on Hercules?

junwang-noaa commented 9 months ago

Maybe we can first try that to see if setting "QUANTIZE_NSD: 0" will resolve the issue with currently library stack.

BinLiu-NOAA commented 9 months ago

@junwang-noaa, from HAFS side (which uses QUANTIZE_NSD of 0), we also experienced the similar issue on Hercules described in this thread (sometimes generating corrupted netcdf files, especially using the netcdf_parallel option for FV3ATM history file). When using netcdf (instead of netcdf_parallel), this FV3ATM history output seems to be fine. Hope this information is useful.

climbfuji commented 9 months ago

@BinLiu-NOAA That is indeed very useful information. What does netcdf vs netcdf_parallel mean here? Reading/writing in serial or parallel mode, but each time through netcdf4 --> hdf5 (I assume)?

BinLiu-NOAA commented 9 months ago

@BinLiu-NOAA That is indeed very useful information. What does netcdf vs netcdf_parallel mean here? Reading/writing in serial or parallel mode, but each time through netcdf4 --> hdf5 (I assume)?

@climbfuji, I meant in the model_configure file for the item: output_file: @[OUTPUT_FILE] with OUTPUT_FILE="'netcdf' 'netcdf'", the FV3ATM history output files on Hercules are more stable than using OUTPUT_FILE="'netcdf_parallel' 'netcdf'".

P.S., the HAFS tests were using the latest version of https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_hercules.intel.lua

climbfuji commented 9 months ago

@DusanJovic-NOAA I built an entire new environment with hdf5@1.14.3 and zlib@1.3. I am going to short-circuit the testing and only create baselines for control_wrtGauss_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug, and then verify against those.

climbfuji commented 9 months ago

Has any of you come across this? I've seen this a few times in the past when compiling on Hercules using rt.sh:

Found Python: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/envs/ue-hdf5-1143/install/intel/2021.9.0/python-3.10.8-omzg5gb/bin/python3.10
Calling CCPP code generator (ccpp_prebuild.py) for suites --suites=FV3_GFS_v16,FV3_GFS_v16_flake,FV3_GFS_v17_p8,FV3_GFS_v17_p8_rrtmgp,FV3_GFS_v15_thompson_mynn_lam3km,FV3_WoFS_v0,FV3_GFS_v17_p8_mynn,FV3_GFS_v17_p8_ugwpv1 ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
+ mv /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_158674/compile_atm_dyn32_intel/build_fv3_atm_dyn32_intel/ufs_model /work2/noaa/jcsda/dheinzel/spst-rebuild/ufs-weather-model-spst150/tests/fv3_atm_dyn32_intel.exe
mv: cannot move '/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_158674/compile_atm_dyn32_intel/build_fv3_atm_dyn32_intel/ufs_model' to a subdirectory of itself, '/work2/noaa/jcsda/dheinzel/spst-rebuild/ufs-weather-model-spst150/tests/fv3_atm_dyn32_intel.exe'
DusanJovic-NOAA commented 9 months ago

@DusanJovic-NOAA I built an entire new environment with hdf5@1.14.3 and zlib@1.3. I am going to short-circuit the testing and only create baselines for control_wrtGauss_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug, and then verify against those.

I created the new baselines for control_wrtGauss_netcdf_parallel and verified against it. The test passed. I'll run it few more times to see if it reliably passes.

DusanJovic-NOAA commented 9 months ago

Second time I ran control_wrtGauss_netcdf_parallel it also passed, but the third time failed.

$ h5dump /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1182887/control_wrtGauss_netcdf_parallel_intel/atmf000.nc > /dev/null
h5dump error: unable to print data
climbfuji commented 9 months ago

Second time I ran control_wrtGauss_netcdf_parallel it also passed, but the third time failed.

$ h5dump /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1182887/control_wrtGauss_netcdf_parallel_intel/atmf000.nc > /dev/null
h5dump error: unable to print data

Hmpf. I think we need to take this back to the netCDF developers. Maybe there's still a bug somewhere in that code. After all, the quantization is a fairly new feature that despite best efforts isn't tested as much as older netCDF/hdf5 features?

BrianCurtis-NOAA commented 9 months ago

IDK if @edwardhartnett would have any idea on how to maybe further debug this?

DusanJovic-NOAA commented 9 months ago

I do not think the quantization is what's causing this issue, see @BinLiu-NOAA 's comments above about similar issues with HAFS, and they do not use quantization.

climbfuji commented 9 months ago

I do not think the quantization is what's causing this issue, see @BinLiu-NOAA 's comments above about similar issues with HAFS, and they do not use quantization.

Good point - so it's the parallel read/write? But didn't someone say beforehand that these issues didn't show up until the quantization PR was merged?

junwang-noaa commented 9 months ago

@BinLiu-NOAA That is indeed very useful information. What does netcdf vs netcdf_parallel mean here? Reading/writing in serial or parallel mode, but each time through netcdf4 --> hdf5 (I assume)?

@climbfuji, I meant in the model_configure file for the item: output_file: @[OUTPUT_FILE] with OUTPUT_FILE="'netcdf' 'netcdf'", the FV3ATM history output files on Hercules are more stable than using OUTPUT_FILE="'netcdf_parallel' 'netcdf'".

P.S., the HAFS tests were using the latest version of https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_hercules.intel.lua

@BinLiu-NOAA what are the ideflate and nbits in your configuration?

DusanJovic-NOAA commented 9 months ago

Probably. Hercules support was added on Sep 20 (#1733). The regional_netcdf_parallel_intel test in the very next PR (#1902) on Sep 21 failed and had to be rerun, based on the log file:

https://github.com/DeniseWorthen/ufs-weather-model/blob/a0969cba9b7182ebace58bc765936131b13439a0/tests/logs/RegressionTests_hercules.log#L2359

jkbk2004 commented 9 months ago

@zach1221 can you re-run the case on hercules with the a0969 commit?

zach1221 commented 9 months ago

@zach1221 can you re-run the case on hercules with the a0969 commit?

Sure

zach1221 commented 9 months ago

@jkbk2004 looks like regional_netcdf_parallel_intel case is passing on hercules against a0969cba9b7182ebace58bc765936131b13439a0 /work/noaa/nems/zshrader/hercules/rt-1902/tests/logs/RegressionTests_hercules.log

climbfuji commented 9 months ago

@jkbk2004 looks like regional_netcdf_parallel_intel case is passing on hercules against a0969cb /work/noaa/nems/zshrader/hercules/rt-1902/tests/logs/RegressionTests_hercules.log

Can you retry a few more times please (I know, sounds like a waste of time), but we've seen those errors intermittently, not all the time.

zach1221 commented 9 months ago

@jkbk2004 looks like regional_netcdf_parallel_intel case is passing on hercules against a0969cb /work/noaa/nems/zshrader/hercules/rt-1902/tests/logs/RegressionTests_hercules.log

Can you retry a few more times please (I know, sounds like a waste of time), but we've seen those errors intermittently, not all the time.

Ok, I've ran it 5 times in the same sand kept the logs in the same directory. All were successful.

climbfuji commented 9 months ago

Thanks for that. Seems to be a pretty good indicator that something happened after a0969cb that triggered the problem on Hercules?

DeniseWorthen commented 9 months ago

The https://github.com/ufs-community/ufs-weather-model/commit/a0969cba9b7182ebace58bc765936131b13439a0 hash is a PR that updates the CICE component. The failing tests are all in standalone ATM tests.

climbfuji commented 9 months ago

The a0969cb hash is a PR that updates the CICE component. The failing tests are all in standalone ATM tests.

Just to avoid any misunderstanding, I wrote "something happened after https://github.com/ufs-community/ufs-weather-model/commit/a0969cba9b7182ebace58bc765936131b13439a0 that triggered the problem on Hercules"?

DeniseWorthen commented 9 months ago

@climbfuji Actually, I'm coming to the conclusion that Hercules has had these issues from the get-go. The a0969cb itself had a failure in one test (regional_netcdf_parallel), right? Then after the quantization PR we started seeing the wrtGauss tests fail more often than not. And HAFS apparently has seen regular issues. So I think the only thing to say for sure is that it is a) intermittent b) seems related to netcdf-parallel and c) has been present since Hercules was added.

BinLiu-NOAA commented 9 months ago

@climbfuji and @DeniseWorthen, just a clarification, we were only able to test HAFS on Hercules very recently.

Meanwhile, since this netcdf_parallel issue only happens on Hercules (but not on other platforms), could this issue be related to the Hercules system itself? I recall there was once an issue for the Orion file system, which affected reproducibility. The Orion system admin eventually isolated and figured out a solution for it.

DusanJovic-NOAA commented 9 months ago

Ok. I cloned commit e053209. This is the first commit that added Hercules support (Sep 20). Well before we added zstd compression and netcdf quantization.

My working copy is here: /work/noaa/fv3-cam/djovic/ufs/hdf_error/ufs-weather-model/tests

Then I just ran: ./rt.sh -n regional_netcdf_parallel intel

Test failed due to missing baselines, but when I go to the run directory: /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1508348/regional_netcdf_parallel_intel

and try to dump the content of history output files I see:

$ h5dump dynf000.nc > /dev/null
h5dump error: unable to print data
h5dump error: unable to print data

The file is corrupted. I even tried to nccmp dynf000.nc vs. dynf006.nc (I know they are not identical, but just wanted to see if nccmp can at least read the data and report the differences):

$ nccmp -df dynf000.nc dynf006.nc
DIFFER : VARIABLE : time : POSITION : [0] : VALUES : 0.01 <> 6
DIFFER : VARIABLE : time_iso : POSITION : [0,12] : VALUES : 0 <> 6
DIFFER : VARIABLE : time_iso : POSITION : [0,17] : VALUES : 3 <> 0
DIFFER : VARIABLE : time_iso : POSITION : [0,18] : VALUES : 6 <> 0
2023-12-12 19:54:06.852195 -0600 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3449 NetCDF: HDF error

Same 'HDF error'.

This is not the reproducibility issue, it's much worse, the files are unreadable, they are corrupted.