Closed DeniseWorthen closed 8 months ago
One question I have is whether that file in question was created with a "faulty" version of hdf5 and if so whether that file needs to be recreated as well. Can we try to open a known "good" file that was created on Orion, for example?
From what I can tell, the last time that the file was compared ok during testing was against the develop-20231117 baseline. That PR (#1967) did not change baselines for this test, and when the RTs were run, the file compared OK.
edit--sorry, I probably mis-understood dom's suggestion for a 'good' comparison.
From what I can tell, the last time that the file was compared ok during testing was against the develop-20231117 baseline. That PR (#1967) did not change baselines for this test, and when the RTs were run, the file compared OK.
https://github.com/ufs-community/ufs-weather-model/pull/1990#issuecomment-1826824914
https://github.com/ufs-community/ufs-weather-model/pull/1990#issuecomment-1827004466
https://github.com/ufs-community/ufs-weather-model/pull/1990#issuecomment-1827876841
?
Yes, #1967 was the previous commit. In #1990, it was noted for the first time. My assumption is that the changes in 1990 is triggering the error, which we're not seeing on other platforms because of hercules-specific hdf5/nccmp issues?
It is, I agree, intermittent, which adds another layer of complexity. You'll note above I got the same error on multiple tests, with different files not comparing each time.
I rebuilt both spack-stack-1.5.0 and spack-stack-1.5.1 on Hercules over the weekend. I created a baseline with the current develop branch of ufs-weather-model using spack-stack-1.5.0 here:
/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel
When I verify against this baseline, I get in my first try:
$ cat fail_test
control_wrtGauss_netcdf_parallel_intel 034 failed in run_test
control_wrtGauss_netcdf_parallel_debug_intel 085 failed in run_test
130 hafs_regional_atm_intel failed in check_result
hafs_regional_atm_intel 130 failed in run_test
with:
$ cat logs/log_hercules/rt_034_control_wrtGauss_netcdf_parallel_intel.log
baseline dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_intel
working dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_intel
Checking test 034 control_wrtGauss_netcdf_parallel_intel results ....
Comparing sfcf000.nc .........OK
Comparing sfcf024.nc ............ALT CHECK......OK
Comparing atmf000.nc ............ALT CHECK......ERROR
and
$ cat logs/log_hercules/rt_085_control_wrtGauss_netcdf_parallel_debug_intel.log
baseline dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel
working dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel
Checking test 085 control_wrtGauss_netcdf_parallel_debug_intel results ....
Comparing sfcf000.nc .........OK
Comparing sfcf001.nc .........OK
Comparing atmf000.nc ............ALT CHECK......OK
Comparing atmf001.nc ............ALT CHECK......ERROR
and
$ cat logs/log_hercules/rt_130_hafs_regional_atm_intel.log
baseline dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/hafs_regional_atm_intel
working dir = /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/hafs_regional_atm_intel
Checking test 130 hafs_regional_atm_intel results ....
Comparing atmf006.nc .........OK
Comparing sfcf006.nc .........OK
Comparing HURPRS.GrbF06 .........NOT OK
0: The total amount of wall time = 289.983108
0: The maximum resident set size (KB) = 873280
Test 130 hafs_regional_atm_intel FAIL Tries: 2
So all of these fail to verify a file, but each of them has a different "exit code" reporting a different failure in rt.sh
. That's quite confusing. (As a side note, I also noted that when one runs compile.sh
, there's something in the scripts that pause the script somewhere after loading the modules. One has to hit enter to get a prompt and then issue fg
to wake up the process.)
@DusanJovic-NOAA @DeniseWorthen - the failure for test 130 is for a GrbF06 file, not for a netCDF file? The other two are netCDF files.
$ nccmp -m /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
[dheinzel@hercules-login-4 tests]$ nccmp -d /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
2023-12-11 08:11:07.315813 -0600 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/cache/build_stage/spack-stage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3449 NetCDF: HDF error
@climbfuji Yes, the GrbF06 is a grib file from running post. I haven't been seeing that particular test fail in cases where I did see the parallel-netcdf tests fail.
$ nccmp -m /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc [dheinzel@hercules-login-4 tests]$ nccmp -d /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc 2023-12-11 08:11:07.315813 -0600 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/cache/build_stage/spack-stage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3449 NetCDF: HDF error
Can you run h5dump to see if we get the same internal error: HDF5: infinite loop closing library
For the file in the baseline directory, it's ok:
> h5dump /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/REGRESSION_TEST/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
...
ATTRIBUTE "units" {
DATATYPE H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "m/sec"
}
}
}
}
}
For the newly created file, it seems to be the same case, but it's still printing to stdout so I'll let it run (did the infinite loop error happen immediately when h5dump
was launched?)
Try to redirect stdout to /dev/null to see if there are any errors
It ran to completion:
ATTRIBUTE "units" {
DATATYPE H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "m/sec"
}
}
}
}
}
I wonder if there's a bug somewhere in the underlying libraries that has to do with quantization. I know, so far we've only seen this on Hercules and only with Intel, but it's a fairly new Intel compiler (2021.9.0) with a fairly new Intel MPI (2021.9.0). We have Intel 2021.10.0 on Derecho, but with cray-mpich.
So, I can run h5dump
successfully on both files. I can also run nccdump
successfully on the newly created baseline file, but if I run ncdump on the rt_
file that I want to verify against the baseline:
>ncdump /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
...
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058,
89.2767128781058, 89.2767128781058, 89.2767128781058, 89.2767128781058 ;
clwmr =
NetCDF: HDF error
Location: file vardata.c; fcn print_rows line 478
That's what we observed. We do not get 'corrupted' file all the time. Sometimes files are written correctly.
h5dump also prints an error "h5dump error: unable to print data" when trying to dump actual data for clwmr variable:
$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump -d /clwmr /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc
HDF5 "/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc" {
DATASET "/clwmr" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 1, 127, 190, 384 ) / ( 1, 127, 190, 384 ) }
DATA {h5dump error: unable to print data
}
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
DATA {
(0): (DATASET 0 "/time"), (DATASET 0 "/pfull"), (DATASET 0 "/grid_yt"),
(3): (DATASET 0 "/grid_xt")
}
}
ATTRIBUTE "_FillValue" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): 9.99e+20
}
}
ATTRIBUTE "_Netcdf4Coordinates" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
DATA {
(0): 5, 3, 1, 0
}
}
ATTRIBUTE "_Netcdf4Dimid" {
DATATYPE H5T_STD_I32LE
DATASPACE SCALAR
DATA {
(0): 5
}
}
ATTRIBUTE "_QuantizeBitRoundNumberOfSignificantBits" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): 14
}
}
ATTRIBUTE "cell_methods" {
DATATYPE H5T_STRING {
STRSIZE 11;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "time: point"
}
}
ATTRIBUTE "long_name" {
DATATYPE H5T_STRING {
STRSIZE 24;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "cloud water mixing ratio"
}
}
ATTRIBUTE "missing_value" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
DATA {
(0): 9.99e+20
}
}
ATTRIBUTE "output_file" {
DATATYPE H5T_STRING {
STRSIZE 3;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "atm"
}
}
ATTRIBUTE "units" {
DATATYPE H5T_STRING {
STRSIZE 5;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "kg/kg"
}
}
}
}
h5dump also prints an error "h5dump error: unable to print data" when trying to dump actual data for clwmr variable:
$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump -d /clwmr /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc HDF5 "/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_4127626/control_wrtGauss_netcdf_parallel_debug_intel/atmf001.nc" { DATASET "/clwmr" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1, 127, 190, 384 ) / ( 1, 127, 190, 384 ) } DATA {h5dump error: unable to print data } ATTRIBUTE "DIMENSION_LIST" { DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }} DATASPACE SIMPLE { ( 4 ) / ( 4 ) } DATA { (0): (DATASET 0 "/time"), (DATASET 0 "/pfull"), (DATASET 0 "/grid_yt"), (3): (DATASET 0 "/grid_xt") } } ATTRIBUTE "_FillValue" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): 9.99e+20 } } ATTRIBUTE "_Netcdf4Coordinates" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 4 ) / ( 4 ) } DATA { (0): 5, 3, 1, 0 } } ATTRIBUTE "_Netcdf4Dimid" { DATATYPE H5T_STD_I32LE DATASPACE SCALAR DATA { (0): 5 } } ATTRIBUTE "_QuantizeBitRoundNumberOfSignificantBits" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): 14 } } ATTRIBUTE "cell_methods" { DATATYPE H5T_STRING { STRSIZE 11; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "time: point" } } ATTRIBUTE "long_name" { DATATYPE H5T_STRING { STRSIZE 24; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "cloud water mixing ratio" } } ATTRIBUTE "missing_value" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 1 ) / ( 1 ) } DATA { (0): 9.99e+20 } } ATTRIBUTE "output_file" { DATATYPE H5T_STRING { STRSIZE 3; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "atm" } } ATTRIBUTE "units" { DATATYPE H5T_STRING { STRSIZE 5; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "kg/kg" } } } }
What do you suggest we should do? From what we know so far this seems to be isolated to Hercules with Intel, doesn't happen for GNU and anywhere else we test.
At this moment these few tests that occasionally create unreadable files are disabled in rt.conf so that we can run rt.sh reliably on Hercules.
Is it too much trouble to try HDF5-1.14.3
https://raw.githubusercontent.com/HDFGroup/hdf5/hdf5_1_14_3/release_docs/RELEASE.txt
@DusanJovic-NOAA I did turn off the non-debug version of control_wrtGauss_netcdf_parallel_intel but it seems that the debug version is also unreliable? Perhaps we should disable that one also in #2009 ?
At this moment these few tests that occasionally create unreadable files are disabled in rt.conf so that we can run rt.sh reliably on Hercules.
Is it too much trouble to try HDF5-1.14.3
https://raw.githubusercontent.com/HDFGroup/hdf5/hdf5_1_14_3/release_docs/RELEASE.txt
That would be just in time for spack-stack-1.6.0. Let me try this!
At this moment these few tests that occasionally create unreadable files are disabled in rt.conf so that we can run rt.sh reliably on Hercules. Is it too much trouble to try HDF5-1.14.3 https://raw.githubusercontent.com/HDFGroup/hdf5/hdf5_1_14_3/release_docs/RELEASE.txt
That would be just in time for spack-stack-1.6.0. Let me try this!
I don't know if you are going to rebuild everything or just hdf5, but since these tests also use deflate (zlib) maybe you can also update zlib to 1.3 from 1.2.13, although I not think zlib is the issue here.
@DusanJovic-NOAA I did turn off the non-debug version of control_wrtGauss_netcdf_parallel_intel but it seems that the debug version is also unreliable? Perhaps we should disable that one also in #2009 ?
Sure.
@DusanJovic-NOAA These two tests (control_wrtGauss_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug) are the only ones in RT using "QUANTIZE_NSD", could that be the cause?
@climbfuji @DusanJovic-NOAA @DeniseWorthen Is the plan to turn off the two tests on Hercules so that we can move forward with PR#2013 and then we will have a follow up PR to fix this?
@junwang-noaa I had already turned off the non-debug test in my template PR. I just asked Nick to turn off the debug test in his s2sa PR.
@DusanJovic-NOAA These two tests (control_wrtGauss_netcdf_parallel and control_wrtGauss_netcdf_parallel_debug) are the only ones in RT using "QUANTIZE_NSD", could that be the cause?
It could be. Are you suggesting to turn off quantization on Hercules?
Maybe we can first try that to see if setting "QUANTIZE_NSD: 0" will resolve the issue with currently library stack.
@junwang-noaa, from HAFS side (which uses QUANTIZE_NSD of 0), we also experienced the similar issue on Hercules described in this thread (sometimes generating corrupted netcdf files, especially using the netcdf_parallel option for FV3ATM history file). When using netcdf (instead of netcdf_parallel), this FV3ATM history output seems to be fine. Hope this information is useful.
@BinLiu-NOAA That is indeed very useful information. What does netcdf
vs netcdf_parallel mean here
? Reading/writing in serial or parallel mode, but each time through netcdf4 --> hdf5 (I assume)?
@BinLiu-NOAA That is indeed very useful information. What does
netcdf
vs netcdf_parallel meanhere
? Reading/writing in serial or parallel mode, but each time through netcdf4 --> hdf5 (I assume)?
@climbfuji, I meant in the model_configure file for the item: output_file: @[OUTPUT_FILE] with OUTPUT_FILE="'netcdf' 'netcdf'", the FV3ATM history output files on Hercules are more stable than using OUTPUT_FILE="'netcdf_parallel' 'netcdf'".
P.S., the HAFS tests were using the latest version of https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_hercules.intel.lua
@DusanJovic-NOAA I built an entire new environment with hdf5@1.14.3 and zlib@1.3. I am going to short-circuit the testing and only create baselines for control_wrtGauss_netcdf_parallel
and control_wrtGauss_netcdf_parallel_debug
, and then verify against those.
Has any of you come across this? I've seen this a few times in the past when compiling on Hercules using rt.sh
:
Found Python: /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rebuild/envs/ue-hdf5-1143/install/intel/2021.9.0/python-3.10.8-omzg5gb/bin/python3.10
Calling CCPP code generator (ccpp_prebuild.py) for suites --suites=FV3_GFS_v16,FV3_GFS_v16_flake,FV3_GFS_v17_p8,FV3_GFS_v17_p8_rrtmgp,FV3_GFS_v15_thompson_mynn_lam3km,FV3_WoFS_v0,FV3_GFS_v17_p8_mynn,FV3_GFS_v17_p8_ugwpv1 ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
+ mv /work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_158674/compile_atm_dyn32_intel/build_fv3_atm_dyn32_intel/ufs_model /work2/noaa/jcsda/dheinzel/spst-rebuild/ufs-weather-model-spst150/tests/fv3_atm_dyn32_intel.exe
mv: cannot move '/work2/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_158674/compile_atm_dyn32_intel/build_fv3_atm_dyn32_intel/ufs_model' to a subdirectory of itself, '/work2/noaa/jcsda/dheinzel/spst-rebuild/ufs-weather-model-spst150/tests/fv3_atm_dyn32_intel.exe'
@DusanJovic-NOAA I built an entire new environment with hdf5@1.14.3 and zlib@1.3. I am going to short-circuit the testing and only create baselines for
control_wrtGauss_netcdf_parallel
andcontrol_wrtGauss_netcdf_parallel_debug
, and then verify against those.
I created the new baselines for control_wrtGauss_netcdf_parallel and verified against it. The test passed. I'll run it few more times to see if it reliably passes.
Second time I ran control_wrtGauss_netcdf_parallel it also passed, but the third time failed.
$ h5dump /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1182887/control_wrtGauss_netcdf_parallel_intel/atmf000.nc > /dev/null
h5dump error: unable to print data
Second time I ran control_wrtGauss_netcdf_parallel it also passed, but the third time failed.
$ h5dump /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1182887/control_wrtGauss_netcdf_parallel_intel/atmf000.nc > /dev/null h5dump error: unable to print data
Hmpf. I think we need to take this back to the netCDF developers. Maybe there's still a bug somewhere in that code. After all, the quantization is a fairly new feature that despite best efforts isn't tested as much as older netCDF/hdf5 features?
IDK if @edwardhartnett would have any idea on how to maybe further debug this?
I do not think the quantization is what's causing this issue, see @BinLiu-NOAA 's comments above about similar issues with HAFS, and they do not use quantization.
I do not think the quantization is what's causing this issue, see @BinLiu-NOAA 's comments above about similar issues with HAFS, and they do not use quantization.
Good point - so it's the parallel read/write? But didn't someone say beforehand that these issues didn't show up until the quantization PR was merged?
@BinLiu-NOAA That is indeed very useful information. What does
netcdf
vs netcdf_parallel meanhere
? Reading/writing in serial or parallel mode, but each time through netcdf4 --> hdf5 (I assume)?@climbfuji, I meant in the model_configure file for the item: output_file: @[OUTPUT_FILE] with OUTPUT_FILE="'netcdf' 'netcdf'", the FV3ATM history output files on Hercules are more stable than using OUTPUT_FILE="'netcdf_parallel' 'netcdf'".
P.S., the HAFS tests were using the latest version of https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_hercules.intel.lua
@BinLiu-NOAA what are the ideflate and nbits in your configuration?
Probably. Hercules support was added on Sep 20 (#1733). The regional_netcdf_parallel_intel test in the very next PR (#1902) on Sep 21 failed and had to be rerun, based on the log file:
@zach1221 can you re-run the case on hercules with the a0969 commit?
@zach1221 can you re-run the case on hercules with the a0969 commit?
Sure
@jkbk2004 looks like regional_netcdf_parallel_intel case is passing on hercules against a0969cba9b7182ebace58bc765936131b13439a0 /work/noaa/nems/zshrader/hercules/rt-1902/tests/logs/RegressionTests_hercules.log
@jkbk2004 looks like regional_netcdf_parallel_intel case is passing on hercules against a0969cb /work/noaa/nems/zshrader/hercules/rt-1902/tests/logs/RegressionTests_hercules.log
Can you retry a few more times please (I know, sounds like a waste of time), but we've seen those errors intermittently, not all the time.
@jkbk2004 looks like regional_netcdf_parallel_intel case is passing on hercules against a0969cb /work/noaa/nems/zshrader/hercules/rt-1902/tests/logs/RegressionTests_hercules.log
Can you retry a few more times please (I know, sounds like a waste of time), but we've seen those errors intermittently, not all the time.
Ok, I've ran it 5 times in the same sand kept the logs in the same directory. All were successful.
Thanks for that. Seems to be a pretty good indicator that something happened after a0969cb that triggered the problem on Hercules?
The https://github.com/ufs-community/ufs-weather-model/commit/a0969cba9b7182ebace58bc765936131b13439a0 hash is a PR that updates the CICE component. The failing tests are all in standalone ATM tests.
The a0969cb hash is a PR that updates the CICE component. The failing tests are all in standalone ATM tests.
Just to avoid any misunderstanding, I wrote "something happened after https://github.com/ufs-community/ufs-weather-model/commit/a0969cba9b7182ebace58bc765936131b13439a0 that triggered the problem on Hercules"?
@climbfuji Actually, I'm coming to the conclusion that Hercules has had these issues from the get-go. The a0969cb itself had a failure in one test (regional_netcdf_parallel), right? Then after the quantization PR we started seeing the wrtGauss tests fail more often than not. And HAFS apparently has seen regular issues. So I think the only thing to say for sure is that it is a) intermittent b) seems related to netcdf-parallel and c) has been present since Hercules was added.
@climbfuji and @DeniseWorthen, just a clarification, we were only able to test HAFS on Hercules very recently.
Meanwhile, since this netcdf_parallel issue only happens on Hercules (but not on other platforms), could this issue be related to the Hercules system itself? I recall there was once an issue for the Orion file system, which affected reproducibility. The Orion system admin eventually isolated and figured out a solution for it.
Ok. I cloned commit e053209. This is the first commit that added Hercules support (Sep 20). Well before we added zstd compression and netcdf quantization.
My working copy is here: /work/noaa/fv3-cam/djovic/ufs/hdf_error/ufs-weather-model/tests
Then I just ran: ./rt.sh -n regional_netcdf_parallel intel
Test failed due to missing baselines, but when I go to the run directory: /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_1508348/regional_netcdf_parallel_intel
and try to dump the content of history output files I see:
$ h5dump dynf000.nc > /dev/null
h5dump error: unable to print data
h5dump error: unable to print data
The file is corrupted. I even tried to nccmp dynf000.nc vs. dynf006.nc (I know they are not identical, but just wanted to see if nccmp can at least read the data and report the differences):
$ nccmp -df dynf000.nc dynf006.nc
DIFFER : VARIABLE : time : POSITION : [0] : VALUES : 0.01 <> 6
DIFFER : VARIABLE : time_iso : POSITION : [0,12] : VALUES : 0 <> 6
DIFFER : VARIABLE : time_iso : POSITION : [0,17] : VALUES : 3 <> 0
DIFFER : VARIABLE : time_iso : POSITION : [0,18] : VALUES : 6 <> 0
2023-12-12 19:54:06.852195 -0600 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3449 NetCDF: HDF error
Same 'HDF error'.
This is not the reproducibility issue, it's much worse, the files are unreadable, they are corrupted.
Description
The control_wrtGauss_netcdf_parallel_intel fails on hercules (intel) with the following error:
which results in
To Reproduce:
Attempt to run this test on Hercules, check the
atmf024.nc_nccmp.log
file in the run directory for the error message.Additional context
The issue was first noted in PR #1990.
Output