Closed DeniseWorthen closed 9 months ago
@DeniseWorthen Do the files actually differ, or is NCCMP getting hung up with one of the options we use in the call to NCCMP?
Unknown. It is failing comparison of that single file with "HDF error", as posted.
Can we compare the two files manually using nccmp? I am wondering if it is file difference or it is nccmp that causes the issue.
Transferring the baseline atmf024.nc file and the output from a failed RT case to hera and comparing them with
nccmp -d -S -q -f -B --Attribute=checksum --warn=format
also produces an error:
2023-11-29 15:13:22.378912 +0000 ERROR nccmp_data.c:3677 NetCDF: HDF error
On Hera, trying to simply dump each of these files to a cdl file produces an error for the baseline file, but not the RT test file:
ncdump atmf024.nc >atmf024.cdl NetCDF: HDF error Location: file vardata.c; fcn print_rows line 478
I suspect it is the baseline file which is bad.
@jkbk2004 Are you using cp
or rsync
for the RDHPCS machines to copy baselines to the baseline storage area?
@jkbk2004 Are you using
cp
orrsync
for the RDHPCS machines to copy baselines to the baseline storage area?
mix use. in this case, files are identical with experiment but nccmp issue
@jkbk2004 Please see the results from Denise.
On Hera, trying to simply dump each of these files to a cdl file produces an error for the baseline file, but not the RT test file:
ncdump atmf024.nc >atmf024.cdl
NetCDF: HDF error
Location: file vardata.c; fcn print_rows line 478
It looks to me. The issue is in the baseline file, not nccmp.
I will setup cases to confirm on both hera and hercules.
Just to note--I copied the files from hercules to hera and the report above is for using nccmp or ncdump on Hera. This was in case there was a problem w/ Hercules' nccmp version.
Thanks, Denise. That is what we want to confirm, if the comparison fails because of the nccmp or because of the file issue. @jkbk2004 I think you only need to check the baseline on Hecules.
Which files are we talking about? These two:
/work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_3114426/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
and
/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
@DusanJovic-NOAA I am checking with /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_175880/control_wrtGauss_netcdf_parallel_intel/atmf024.nc
@DusanJovic-NOAA The file that fails w/ the hdf error is for the atmf024.nc file.
@zach1221 nccmp/hercules compares ok /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_166684/control_wrtGauss_netcdf_parallel_intel/atmf024.nc atmf024.nc and /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf024.nc but fails with /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_175880/control_wrtGauss_netcdf_parallel_intel/atmf024.nc @DeniseWorthen @junwang-noaa ncdump conversion to cdl works ok with both baseline one and rt_166684 but not rt_175880. It looks hercules system issue
@DeniseWorthen I finally got around to testing your idea, regarding disk space. I cleaned out my experiment directories in stmp, and re-ran the control_wrtGauss cases on again with ecflow. With an emptied space they pass consistently.
@zach1221 Really interesting, thanks. I wonder why this works!? I did check the file sizes for the atmf files and they weren't that large. Hm.
I've updated the title of this Issue. I just ran on Hercules with the CICE PR branch and got multiple failures with "ALT CHECK ERROR" for various tests:
Checking test 034 control_wrtGauss_netcdf_parallel_intel results .... Comparing atmf024.nc ............ALT CHECK......ERROR
Checking test 060 regional_netcdf_parallel_intel results .... Comparing phyf000.nc ............ALT CHECK......ERROR
Checking test 085 control_wrtGauss_netcdf_parallel_debug_intel results Comparing atmf000.nc ............ALT CHECK......ERROR
Checking test 126 conus13km_debug_qr_intel results .... Comparing RESTART/20210512.170000.fv_core.res.tile1.nc ............ALT CHECK......ERROR
Running these four tests again gave a pass for the conus13 and the netcdf_parallel_debug. The other two again had ERRORS, but on different .nc files. These tests appear to be unstable on hercules.
Running these four tests again gave a pass for the conus13 and the netcdf_parallel_debug. The other two again had ERRORS, but on different .nc files. These tests appear to be unstable on hercules.
I those cases that fail, do you (always) see 'HDF error'?
@DusanJovic-NOAA Yes, the failed cases seem to always show
ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-st
age-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error
My run directory is
/work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/
It seems the file /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc is either corrupted, or there's a bug in HDF5 library, or both.
When I run ncdump I see:
$ ncdump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
ncdump: /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc: NetCDF: HDF error
Same error message we see when we run nccmp. However when I run h5dump, I see:
$ h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
I'm not sure how to interpret this error message, HDF5 library bug, MPI/compiler bug, system/filesystem issue or something else.
@DeniseWorthen @DusanJovic-NOAA So far I noticed this error only happened on Hercules. We have hdf5 1.14.0 installed on other platforms for a while and I have never heard the error. I think we need to report the problem to Hercules system admin.
@climbfuji @ulmononian FYI: some issues with HDF/nccmp on hercules. Random failures when it writes nc files.
@ulmononian Can EPIC look into this?
@DusanJovic-NOAA I went back and looked at the all the hercules logs. The first time that the control_wrtGauss_netcdf_parallel_intel has the alt-check error is in your PR #1990
@DusanJovic-NOAA I went back and looked at the all the hercules logs. The first time that the control_wrtGauss_netcdf_parallel_intel has the alt-check error is in your PR #1990
That's the PR in which we added netcdf quantization. Maybe that is somehow triggering this HDF error, but why only on Hercules?
Maybe the HDF version that @junwang-noaa pointed out?
@ulmononian Would you please if the netcdf is built with zstandard library on Hercules or if there is any difference in the netcdf build between Hercules and Orion?
@junwang-noaa let me take a look.
can i ask why the stack being used is /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1
and not /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0
?
That is a good question. We have this module file in develop branch:
@DeniseWorthen which version of ufs wm develop branch are you using in the test you showed in the issue description?
The initial issue was created using the develop branch.
The latest results I posted were from my CICE PR branch, but none of the failed tests contain CICE so I don't think that is the issue. The CICE PR would be using the standard hercules modules at develop.
Was nccmp built w/ the "rc1" stack? Isn't that the only possibility?
if i load the version of spack-stack used in wm develop (i.e. /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core
) via:
module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.10.8
module spider nccmp
gives nccmp/1.9.0.1
. module show nccmp/1.9.0.1
shows:
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/nccmp/1.9.0.1.lua:
----------------------------------------------------------------------------
whatis("Name : nccmp")
whatis("Version : 1.9.0.1")
whatis("Target : icelake")
whatis("Short description : Compare NetCDF Files")
whatis("Configure options : -DCMAKE_C_FLAGS:STRING=-std=c99")
help([[Name : nccmp]])
help([[Version: 1.9.0.1]])
help([[Target : icelake]])
help()
help([[Compare NetCDF Files]])
depends_on("netcdf-c/4.9.2")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/bin")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/.")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/bin")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/.")
setenv("nccmp_ROOT","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa")
so it does not seem that nccmp should be associated with the "rc1" stack.
@ulmononian I really don't understand what is going on. The error message we get using nccmp in the RT script is
ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-st
age-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error
So what does the "1.5.0-rc1" in that path refer to?
that is a different stack entirely. is there somewhere in your branch or the script where nccmp is being called from another stack?
The rc1
stack was a release candidate stack for acceptance testing / identifying problems.
Please see the error message from h5dump in Dusan's message, it also points to the spack-stack 1.5.0-rc1.
can someone run their file diff checks using the nccmp from /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0
? the module file is in /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/nccmp/1.9.0.1.lua
,
load it via
module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.10.8
module load nccmp 1.9.0.1
somewhere in the CICE PR branch or develop the wrong stack is being called.
Please see the error message from h5dump in Dusan's message, it also points to the spack-stack 1.5.0-rc1. are the netcdf/hdf5 in spack-stack 1.5.0 pointing to correct spack-stack version?
with spack-stack/1.5.0 loaded,
module show hdf5/1.14.0:
[cbook@hercules-login-1 ~]$ module show hdf5/1.14.0
------------------------------------------------------------------------------------------------------------
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/hdf5/1.14.0.lua:
------------------------------------------------------------------------------------------------------------
whatis("Name : hdf5")
whatis("Version : 1.14.0")
whatis("Target : icelake")
whatis("Short description : HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. ")
whatis("Configure options : -DALLOW_UNSUPPORTED:BOOL=ON -DHDF5_BUILD_EXAMPLES:BOOL=OFF -DBUILD_TESTING:BOOL=OFF -DHDF5_ENABLE_MAP_API:BOOL=OFF -DHDF5_ENABLE_Z_LIB_SUPPORT:BOOL=ON -DHDF5_ENABLE_SZIP_SUPPORT:BOOL=ON -DHDF5_ENABLE_SZIP_ENCODING:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=ON -DONLY_SHARED_LIBS:BOOL=OFF -DHDF5_ENABLE_PARALLEL:BOOL=ON -DHDF5_ENABLE_THREADSAFE:BOOL=OFF -DHDF5_BUILD_HL_LIB:BOOL=ON -DHDF5_BUILD_CPP_LIB:BOOL=OFF -DHDF5_BUILD_FORTRAN:BOOL=ON -DHDF5_BUILD_JAVA:BOOL=OFF -DHDF5_BUILD_TOOLS:BOOL=ON -DMPI_CXX_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicpc -DMPI_C_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicc -DCMAKE_CXX_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicpc -DCMAKE_C_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicc -DMPI_Fortran_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiifort -DCMAKE_Fortran_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiifort")
help([[Name : hdf5]])
help([[Version: 1.14.0]])
help([[Target : icelake]])
help()
help([[HDF5 is a data model, library, and file format for storing and managing
data. It supports an unlimited variety of datatypes, and is designed for
flexible and efficient I/O and for high volume and complex data.]])
depends_on("zlib/1.2.13")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin")
prepend_path("LD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib")
prepend_path("DYLD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib")
prepend_path("CPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/include")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/.")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/.")
append_path("LD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib")
setenv("hdf5_ROOT","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt")
setenv("HDF5_DIR","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt")
module show netcdf-c/4.9.2
:
------------------------------------------------------------------------------------------------------------
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/netcdf-c/4.9.2.lua:
------------------------------------------------------------------------------------------------------------
whatis("Name : netcdf-c")
whatis("Version : 4.9.2")
whatis("Target : icelake")
whatis("Short description : NetCDF (network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. This is the C distribution.")
help([[Name : netcdf-c]])
help([[Version: 4.9.2]])
help([[Target : icelake]])
help()
help([[NetCDF (network Common Data Form) is a set of software libraries and
machine-independent data formats that support the creation, access, and
sharing of array-oriented scientific data. This is the C distribution.]])
depends_on("c-blosc/1.21.4")
depends_on("curl/8.1.2")
depends_on("hdf5/1.14.0")
depends_on("zlib/1.2.13")
depends_on("zstd/1.5.2")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/bin")
prepend_path("MANPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/share/man")
prepend_path("LD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib")
prepend_path("DYLD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib")
prepend_path("CPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/include")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/.")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/bin")
prepend_path("MANPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/share/man")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/.")
append_path("HDF5_PLUGIN_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/plugins")
setenv("netcdf_c_ROOT","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/instal
l/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx")
append_path("MANPATH","")
no reference to "rc1"...not sure what is going on yet.
netcdf-c/4.9.2 does depend on zstd for spack-stack/1.5.0, btw.
$ which h5dump
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump
$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
Running the h5dump executable from spack-stack-1.5.0
, points to a source code from spack-stack-1.5.0-rc1/cache/build_stage
, while printing the error message. Is this correct?
I think that might be what happened.
$ which h5dump /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump $ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525) HDF5: infinite loop closing library L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
Running the h5dump executable from
spack-stack-1.5.0
, points to a source code fromspack-stack-1.5.0-rc1/cache/build_stage
, while printing the error message. Is this correct?
I know what's going on here, and it may or may not be the issue. When we installed spack-stack-1.5.0, we created a binary cache based on 1.5.0-rc1 and installed from that. This is "usually" safe, unless changes are made in spack itself that the spack hash algorithm doesn't see and therefore still thinks the packages are the same.
@DusanJovic-NOAA Is it possible to test with spack-stack-1.5.1, just to see if the problem goes away? I know for sure we built 1.5.1 from source.
@climbfuji Probably not. I tried spack-stack 1.5.1 with fms 2023.03 several days ago, I was also puzzled why the control_wrtGauss_netcdf_parallel_intel failed.
baseline dir = /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel
working dir = /work2/noaa/stmp/junwang/stmp/junwang/FV3_RT/rt_3585232/control_wrtGauss_netcdf_parallel_intel
Checking test 034 control_wrtGauss_netcdf_parallel_intel results ....
Comparing sfcf000.nc ............ALT CHECK......OK
Comparing sfcf024.nc .........OK
Comparing atmf000.nc ............ALT CHECK......ERROR
I have this in my module file:
prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core")
stack_intel_ver=os.getenv("stack_intel_ver") or "2021.9.0"
load(pathJoin("stack-intel", stack_intel_ver))
stack_impi_ver=os.getenv("stack_impi_ver") or "2021.9.0"
load(pathJoin("stack-intel-oneapi-mpi", stack_impi_ver))
cmake_ver=os.getenv("cmake_ver") or "3.23.1"
load(pathJoin("cmake", cmake_ver))
load("ufs_common")
nccmp_ver=os.getenv("nccmp_ver") or "1.9.0.1"
load(pathJoin("nccmp", nccmp_ver))
My code directory is at: /work/noaa/nems/junwang/ufs-weather/20231106/new/ufs-weather-model
@DusanJovic-NOAA if you can test w/ 1.5.1 as @climbfuji suggests, you can load directly via
module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.10.8
or just replace the key paths in the hercules intel lua file.
edit: sorry, just saw @junwang-noaa's comment...
$ module purge
$ module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
$ module load stack-intel/2021.9.0 stack-intel-oneapi-mpi/2021.9.0 hdf5/1.14.0
$ ml
Currently Loaded Modules:
1) intel-oneapi-compilers/2023.1.0 2) stack-intel/2021.9.0 3) intel-oneapi-mpi/2021.9.0 4) stack-intel-oneapi-mpi/2021.9.0 5) zlib/1.2.13 6) hdf5/1.14.0
$ which h5dump
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump
$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL
Looks like hdf5 in stack 1.5.1 is also built 'using cache' from 1.5.0-rc1
I just replaced '1.5.0' to '1.5.1' in ufs_hercules.intel.lua, and:
$ module load ufs_hercules.intel
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "gftl-shared/1.5.0", "esmf/8.4.2", "mapl/2.35.2-esmf-8.4.2"
Try: "module spider gftl-shared/1.5.0 esmf/8.4.2 mapl/2.35.2-esmf-8.4.2" to see how to load the module(s).
So unless we also update esmf and friends we can not quickly just try 1.5.1. Which I'm skeptical will work based on @junwang-noaa 's comment above.
Is it worth trying a full stack 1.5.0 rebuild, but making sure no 'old caches are used'?
I just replaced '1.5.0' to '1.5.1' in ufs_hercules.intel.lua, and:
$ module load ufs_hercules.intel Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "gftl-shared/1.5.0", "esmf/8.4.2", "mapl/2.35.2-esmf-8.4.2" Try: "module spider gftl-shared/1.5.0 esmf/8.4.2 mapl/2.35.2-esmf-8.4.2" to see how to load the module(s).
So unless we also update esmf and friends we can not quickly just try 1.5.1. Which I'm skeptical will work based on @junwang-noaa 's comment above.
Is it worth trying a full stack 1.5.0 rebuild, but making sure no 'old caches are used'?
Yes, absolutely. I will do that over the weekend for both 1.5.0 and 1.5.1. I do have another suspicion that I will check when I am at it. Apologies for these issues - hopefully I can give you an answer on Monday why both 1.5.0 and 1.5.1 have the rc1
in the source file paths.
Description
The control_wrtGauss_netcdf_parallel_intel fails on hercules (intel) with the following error:
which results in
To Reproduce:
Attempt to run this test on Hercules, check the
atmf024.nc_nccmp.log
file in the run directory for the error message.Additional context
The issue was first noted in PR #1990.
Output