ufs-community / ufs-weather-model

UFS Weather Model
Other
130 stars 238 forks source link

multiple netcdf_parallel tests fail on hercules #2015

Closed DeniseWorthen closed 5 months ago

DeniseWorthen commented 6 months ago

Description

The control_wrtGauss_netcdf_parallel_intel fails on hercules (intel) with the following error:

2023-11-29 11:42:48.544357 +0000 ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-s\
tage-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error

which results in

 Comparing atmf024.nc ............ALT CHECK......ERROR

To Reproduce:

Attempt to run this test on Hercules, check the atmf024.nc_nccmp.log file in the run directory for the error message.

Additional context

The issue was first noted in PR #1990.

Output

BrianCurtis-NOAA commented 6 months ago

@DeniseWorthen Do the files actually differ, or is NCCMP getting hung up with one of the options we use in the call to NCCMP?

DeniseWorthen commented 6 months ago

Unknown. It is failing comparison of that single file with "HDF error", as posted.

junwang-noaa commented 6 months ago

Can we compare the two files manually using nccmp? I am wondering if it is file difference or it is nccmp that causes the issue.

DeniseWorthen commented 6 months ago

Transferring the baseline atmf024.nc file and the output from a failed RT case to hera and comparing them with

nccmp -d -S -q -f -B --Attribute=checksum --warn=format

also produces an error:

2023-11-29 15:13:22.378912 +0000 ERROR nccmp_data.c:3677 NetCDF: HDF error

On Hera, trying to simply dump each of these files to a cdl file produces an error for the baseline file, but not the RT test file:

ncdump atmf024.nc >atmf024.cdl NetCDF: HDF error Location: file vardata.c; fcn print_rows line 478

I suspect it is the baseline file which is bad.

BrianCurtis-NOAA commented 6 months ago

@jkbk2004 Are you using cp or rsync for the RDHPCS machines to copy baselines to the baseline storage area?

jkbk2004 commented 6 months ago

@jkbk2004 Are you using cp or rsync for the RDHPCS machines to copy baselines to the baseline storage area?

mix use. in this case, files are identical with experiment but nccmp issue

junwang-noaa commented 6 months ago

@jkbk2004 Please see the results from Denise.

On Hera, trying to simply dump each of these files to a cdl file produces an error for the baseline file, but not the RT test file:

ncdump atmf024.nc >atmf024.cdl
NetCDF: HDF error
Location: file vardata.c; fcn print_rows line 478

It looks to me. The issue is in the baseline file, not nccmp.

jkbk2004 commented 6 months ago

I will setup cases to confirm on both hera and hercules.

DeniseWorthen commented 6 months ago

Just to note--I copied the files from hercules to hera and the report above is for using nccmp or ncdump on Hera. This was in case there was a problem w/ Hercules' nccmp version.

junwang-noaa commented 6 months ago

Thanks, Denise. That is what we want to confirm, if the comparison fails because of the nccmp or because of the file issue. @jkbk2004 I think you only need to check the baseline on Hecules.

DusanJovic-NOAA commented 6 months ago

Which files are we talking about? These two:

/work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_3114426/control_wrtGauss_netcdf_parallel_intel/atmf000.nc

and

/work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf000.nc

jkbk2004 commented 6 months ago

@DusanJovic-NOAA I am checking with /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_175880/control_wrtGauss_netcdf_parallel_intel/atmf024.nc

DeniseWorthen commented 6 months ago

@DusanJovic-NOAA The file that fails w/ the hdf error is for the atmf024.nc file.

jkbk2004 commented 6 months ago

@zach1221 nccmp/hercules compares ok /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_166684/control_wrtGauss_netcdf_parallel_intel/atmf024.nc atmf024.nc and /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel/atmf024.nc but fails with /work2/noaa/stmp/zshrader/stmp/zshrader/FV3_RT/rt_175880/control_wrtGauss_netcdf_parallel_intel/atmf024.nc @DeniseWorthen @junwang-noaa ncdump conversion to cdl works ok with both baseline one and rt_166684 but not rt_175880. It looks hercules system issue

zach1221 commented 6 months ago

@DeniseWorthen I finally got around to testing your idea, regarding disk space. I cleaned out my experiment directories in stmp, and re-ran the control_wrtGauss cases on again with ecflow. With an emptied space they pass consistently.

DeniseWorthen commented 6 months ago

@zach1221 Really interesting, thanks. I wonder why this works!? I did check the file sizes for the atmf files and they weren't that large. Hm.

DeniseWorthen commented 6 months ago

I've updated the title of this Issue. I just ran on Hercules with the CICE PR branch and got multiple failures with "ALT CHECK ERROR" for various tests:

Checking test 034 control_wrtGauss_netcdf_parallel_intel results .... Comparing atmf024.nc ............ALT CHECK......ERROR

Checking test 060 regional_netcdf_parallel_intel results .... Comparing phyf000.nc ............ALT CHECK......ERROR

Checking test 085 control_wrtGauss_netcdf_parallel_debug_intel results Comparing atmf000.nc ............ALT CHECK......ERROR

Checking test 126 conus13km_debug_qr_intel results .... Comparing RESTART/20210512.170000.fv_core.res.tile1.nc ............ALT CHECK......ERROR

DeniseWorthen commented 6 months ago

Running these four tests again gave a pass for the conus13 and the netcdf_parallel_debug. The other two again had ERRORS, but on different .nc files. These tests appear to be unstable on hercules.

DusanJovic-NOAA commented 6 months ago

Running these four tests again gave a pass for the conus13 and the netcdf_parallel_debug. The other two again had ERRORS, but on different .nc files. These tests appear to be unstable on hercules.

I those cases that fail, do you (always) see 'HDF error'?

DeniseWorthen commented 6 months ago

@DusanJovic-NOAA Yes, the failed cases seem to always show

ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-st
age-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error

My run directory is

/work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/
DusanJovic-NOAA commented 6 months ago

It seems the file /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc is either corrupted, or there's a bug in HDF5 library, or both.

When I run ncdump I see:

$ ncdump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
ncdump: /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc: NetCDF: HDF error

Same error message we see when we run nccmp. However when I run h5dump, I see:

$ h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

I'm not sure how to interpret this error message, HDF5 library bug, MPI/compiler bug, system/filesystem issue or something else.

junwang-noaa commented 6 months ago

@DeniseWorthen @DusanJovic-NOAA So far I noticed this error only happened on Hercules. We have hdf5 1.14.0 installed on other platforms for a while and I have never heard the error. I think we need to report the problem to Hercules system admin.

jkbk2004 commented 6 months ago

@climbfuji @ulmononian FYI: some issues with HDF/nccmp on hercules. Random failures when it writes nc files.

climbfuji commented 6 months ago

@ulmononian Can EPIC look into this?

DeniseWorthen commented 6 months ago

@DusanJovic-NOAA I went back and looked at the all the hercules logs. The first time that the control_wrtGauss_netcdf_parallel_intel has the alt-check error is in your PR #1990

https://github.com/ufs-community/ufs-weather-model/blob/f6918a10f5d16465d7e523e9741eb541a7a3379f/tests/logs/RegressionTests_hercules.log#L1976

DusanJovic-NOAA commented 6 months ago

@DusanJovic-NOAA I went back and looked at the all the hercules logs. The first time that the control_wrtGauss_netcdf_parallel_intel has the alt-check error is in your PR #1990

https://github.com/ufs-community/ufs-weather-model/blob/f6918a10f5d16465d7e523e9741eb541a7a3379f/tests/logs/RegressionTests_hercules.log#L1976

That's the PR in which we added netcdf quantization. Maybe that is somehow triggering this HDF error, but why only on Hercules?

DeniseWorthen commented 6 months ago

Maybe the HDF version that @junwang-noaa pointed out?

junwang-noaa commented 6 months ago

@ulmononian Would you please if the netcdf is built with zstandard library on Hercules or if there is any difference in the netcdf build between Hercules and Orion?

ulmononian commented 6 months ago

@junwang-noaa let me take a look.

ulmononian commented 6 months ago

can i ask why the stack being used is /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1 and not /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0?

junwang-noaa commented 6 months ago

That is a good question. We have this module file in develop branch:

https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_hercules.intel.lua#L5

@DeniseWorthen which version of ufs wm develop branch are you using in the test you showed in the issue description?

DeniseWorthen commented 6 months ago

The initial issue was created using the develop branch.

The latest results I posted were from my CICE PR branch, but none of the failed tests contain CICE so I don't think that is the issue. The CICE PR would be using the standard hercules modules at develop.

DeniseWorthen commented 6 months ago

Was nccmp built w/ the "rc1" stack? Isn't that the only possibility?

ulmononian commented 6 months ago

if i load the version of spack-stack used in wm develop (i.e. /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core) via:

module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.10.8

module spider nccmp gives nccmp/1.9.0.1. module show nccmp/1.9.0.1 shows:

   /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/nccmp/1.9.0.1.lua:
----------------------------------------------------------------------------
whatis("Name : nccmp")
whatis("Version : 1.9.0.1")
whatis("Target : icelake")
whatis("Short description : Compare NetCDF Files")
whatis("Configure options : -DCMAKE_C_FLAGS:STRING=-std=c99")
help([[Name   : nccmp]])
help([[Version: 1.9.0.1]])
help([[Target : icelake]])
help()
help([[Compare NetCDF Files]])
depends_on("netcdf-c/4.9.2")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/bin")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/.")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/bin")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa/.")
setenv("nccmp_ROOT","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/nccmp-1.9.0.1-4n5sfwa")

so it does not seem that nccmp should be associated with the "rc1" stack.

DeniseWorthen commented 6 months ago

@ulmononian I really don't understand what is going on. The error message we get using nccmp in the RT script is

ERROR /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-st
age-nccmp-1.9.0.1-4n5sfwacmwzksu4hkop5vwvjpqowwa3o/spack-src/src/nccmp_data.c:3675 NetCDF: HDF error

So what does the "1.5.0-rc1" in that path refer to?

ulmononian commented 6 months ago

that is a different stack entirely. is there somewhere in your branch or the script where nccmp is being called from another stack?

climbfuji commented 6 months ago

The rc1 stack was a release candidate stack for acceptance testing / identifying problems.

junwang-noaa commented 6 months ago

Please see the error message from h5dump in Dusan's message, it also points to the spack-stack 1.5.0-rc1.

ulmononian commented 6 months ago

can someone run their file diff checks using the nccmp from /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0? the module file is in /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/nccmp/1.9.0.1.lua,

load it via

module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.10.8

module load nccmp 1.9.0.1

somewhere in the CICE PR branch or develop the wrong stack is being called.

junwang-noaa commented 6 months ago

Please see the error message from h5dump in Dusan's message, it also points to the spack-stack 1.5.0-rc1. are the netcdf/hdf5 in spack-stack 1.5.0 pointing to correct spack-stack version?

ulmononian commented 6 months ago

with spack-stack/1.5.0 loaded,

module show hdf5/1.14.0:

[cbook@hercules-login-1 ~]$ module show hdf5/1.14.0
------------------------------------------------------------------------------------------------------------
   /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/hdf5/1.14.0.lua:
------------------------------------------------------------------------------------------------------------
whatis("Name : hdf5")
whatis("Version : 1.14.0")
whatis("Target : icelake")
whatis("Short description : HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. ")
whatis("Configure options : -DALLOW_UNSUPPORTED:BOOL=ON -DHDF5_BUILD_EXAMPLES:BOOL=OFF -DBUILD_TESTING:BOOL=OFF -DHDF5_ENABLE_MAP_API:BOOL=OFF -DHDF5_ENABLE_Z_LIB_SUPPORT:BOOL=ON -DHDF5_ENABLE_SZIP_SUPPORT:BOOL=ON -DHDF5_ENABLE_SZIP_ENCODING:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=ON -DONLY_SHARED_LIBS:BOOL=OFF -DHDF5_ENABLE_PARALLEL:BOOL=ON -DHDF5_ENABLE_THREADSAFE:BOOL=OFF -DHDF5_BUILD_HL_LIB:BOOL=ON -DHDF5_BUILD_CPP_LIB:BOOL=OFF -DHDF5_BUILD_FORTRAN:BOOL=ON -DHDF5_BUILD_JAVA:BOOL=OFF -DHDF5_BUILD_TOOLS:BOOL=ON -DMPI_CXX_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicpc -DMPI_C_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicc -DCMAKE_CXX_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicpc -DCMAKE_C_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiicc -DMPI_Fortran_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiifort -DCMAKE_Fortran_COMPILER:PATH=/apps/spack-managed/oneapi-2023.1.0/intel-oneapi-mpi-2021.9.0-a66eaipzsnyrdgaqzxmqmqz64qzvhkse/mpi/2021.9.0/bin/mpiifort")
help([[Name   : hdf5]])
help([[Version: 1.14.0]])
help([[Target : icelake]])
help()
help([[HDF5 is a data model, library, and file format for storing and managing
data. It supports an unlimited variety of datatypes, and is designed for
flexible and efficient I/O and for high volume and complex data.]])
depends_on("zlib/1.2.13")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin")
prepend_path("LD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib")
prepend_path("DYLD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib")
prepend_path("CPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/include")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/.")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/.")
append_path("LD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/lib")
setenv("hdf5_ROOT","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt")
setenv("HDF5_DIR","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt")

module show netcdf-c/4.9.2:

------------------------------------------------------------------------------------------------------------
   /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/modulefiles/intel-oneapi-mpi/2021.9.0/intel/2021.9.0/netcdf-c/4.9.2.lua:
------------------------------------------------------------------------------------------------------------
whatis("Name : netcdf-c")
whatis("Version : 4.9.2")
whatis("Target : icelake")
whatis("Short description : NetCDF (network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. This is the C distribution.")
help([[Name   : netcdf-c]])
help([[Version: 4.9.2]])
help([[Target : icelake]])
help()
help([[NetCDF (network Common Data Form) is a set of software libraries and
machine-independent data formats that support the creation, access, and
sharing of array-oriented scientific data. This is the C distribution.]])
depends_on("c-blosc/1.21.4")
depends_on("curl/8.1.2")
depends_on("hdf5/1.14.0")
depends_on("zlib/1.2.13")
depends_on("zstd/1.5.2")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/bin")
prepend_path("MANPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/share/man")
prepend_path("LD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib")
prepend_path("DYLD_LIBRARY_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib")
prepend_path("CPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/include")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/.")
prepend_path("PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/bin")
prepend_path("MANPATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/share/man")
prepend_path("PKG_CONFIG_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/lib/pkgconfig")
prepend_path("CMAKE_PREFIX_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/.")
append_path("HDF5_PLUGIN_PATH","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx/plugins")
setenv("netcdf_c_ROOT","/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/instal
l/intel/2021.9.0/netcdf-c-4.9.2-blbiwxx")
append_path("MANPATH","")

no reference to "rc1"...not sure what is going on yet.

ulmononian commented 6 months ago

netcdf-c/4.9.2 does depend on zstd for spack-stack/1.5.0, btw.

DusanJovic-NOAA commented 6 months ago
$ which  h5dump
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump

$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

Running the h5dump executable from spack-stack-1.5.0, points to a source code from spack-stack-1.5.0-rc1/cache/build_stage, while printing the error message. Is this correct?

junwang-noaa commented 6 months ago

I think that might be what happened.

climbfuji commented 6 months ago
$ which  h5dump
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump

$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

Running the h5dump executable from spack-stack-1.5.0, points to a source code from spack-stack-1.5.0-rc1/cache/build_stage, while printing the error message. Is this correct?

I know what's going on here, and it may or may not be the issue. When we installed spack-stack-1.5.0, we created a binary cache based on 1.5.0-rc1 and installed from that. This is "usually" safe, unless changes are made in spack itself that the spack hash algorithm doesn't see and therefore still thinks the packages are the same.

@DusanJovic-NOAA Is it possible to test with spack-stack-1.5.1, just to see if the problem goes away? I know for sure we built 1.5.1 from source.

junwang-noaa commented 6 months ago

@climbfuji Probably not. I tried spack-stack 1.5.1 with fms 2023.03 several days ago, I was also puzzled why the control_wrtGauss_netcdf_parallel_intel failed.

baseline dir = /work/noaa/epic/hercules/UFS-WM_RT/NEMSfv3gfs/develop-20231122/control_wrtGauss_netcdf_parallel_intel
working dir  = /work2/noaa/stmp/junwang/stmp/junwang/FV3_RT/rt_3585232/control_wrtGauss_netcdf_parallel_intel
Checking test 034 control_wrtGauss_netcdf_parallel_intel results ....
 Comparing sfcf000.nc ............ALT CHECK......OK
 Comparing sfcf024.nc .........OK
 Comparing atmf000.nc ............ALT CHECK......ERROR

I have this in my module file:

prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core")

stack_intel_ver=os.getenv("stack_intel_ver") or "2021.9.0"
load(pathJoin("stack-intel", stack_intel_ver))

stack_impi_ver=os.getenv("stack_impi_ver") or "2021.9.0"
load(pathJoin("stack-intel-oneapi-mpi", stack_impi_ver))

cmake_ver=os.getenv("cmake_ver") or "3.23.1"
load(pathJoin("cmake", cmake_ver))

load("ufs_common")

nccmp_ver=os.getenv("nccmp_ver") or "1.9.0.1"
load(pathJoin("nccmp", nccmp_ver))

My code directory is at: /work/noaa/nems/junwang/ufs-weather/20231106/new/ufs-weather-model

ulmononian commented 6 months ago

@DusanJovic-NOAA if you can test w/ 1.5.1 as @climbfuji suggests, you can load directly via

module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.10.8

or just replace the key paths in the hercules intel lua file.

edit: sorry, just saw @junwang-noaa's comment...

DusanJovic-NOAA commented 6 months ago
$ module purge

$ module use /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core

$ module load stack-intel/2021.9.0 stack-intel-oneapi-mpi/2021.9.0 hdf5/1.14.0

$ ml

Currently Loaded Modules:
  1) intel-oneapi-compilers/2023.1.0   2) stack-intel/2021.9.0   3) intel-oneapi-mpi/2021.9.0   4) stack-intel-oneapi-mpi/2021.9.0   5) zlib/1.2.13   6) hdf5/1.14.0

$ which h5dump
/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump

$ /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.1/envs/unified-env/install/intel/2021.9.0/hdf5-1.14.0-4qmsxzt/bin/h5dump /work2/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/rt_3821674/control_wrtGauss_netcdf_parallel_intel/atmf000.nc
h5dump error: internal error (file /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.5.0-rc1/cache/build_stage/spack-stage-hdf5-1.14.0-4qmsxztujdfpvzjzay4dyr2d2vxd352n/spack-src/tools/src/h5dump/h5dump.c:line 1525)
HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,SL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL,FL

Looks like hdf5 in stack 1.5.1 is also built 'using cache' from 1.5.0-rc1

DusanJovic-NOAA commented 6 months ago

I just replaced '1.5.0' to '1.5.1' in ufs_hercules.intel.lua, and:

$ module load ufs_hercules.intel 
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be loaded as requested: "gftl-shared/1.5.0", "esmf/8.4.2", "mapl/2.35.2-esmf-8.4.2"
   Try: "module spider gftl-shared/1.5.0 esmf/8.4.2 mapl/2.35.2-esmf-8.4.2" to see how to load the module(s).

So unless we also update esmf and friends we can not quickly just try 1.5.1. Which I'm skeptical will work based on @junwang-noaa 's comment above.

Is it worth trying a full stack 1.5.0 rebuild, but making sure no 'old caches are used'?

climbfuji commented 6 months ago

I just replaced '1.5.0' to '1.5.1' in ufs_hercules.intel.lua, and:

$ module load ufs_hercules.intel 
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be loaded as requested: "gftl-shared/1.5.0", "esmf/8.4.2", "mapl/2.35.2-esmf-8.4.2"
   Try: "module spider gftl-shared/1.5.0 esmf/8.4.2 mapl/2.35.2-esmf-8.4.2" to see how to load the module(s).

So unless we also update esmf and friends we can not quickly just try 1.5.1. Which I'm skeptical will work based on @junwang-noaa 's comment above.

Is it worth trying a full stack 1.5.0 rebuild, but making sure no 'old caches are used'?

Yes, absolutely. I will do that over the weekend for both 1.5.0 and 1.5.1. I do have another suspicion that I will check when I am at it. Apologies for these issues - hopefully I can give you an answer on Monday why both 1.5.0 and 1.5.1 have the rc1 in the source file paths.