Closed SamuelTrahanNOAA closed 2 months ago
I'm pinging @DusanJovic-NOAA and @junwang-noaa hoping they have some guesses.
Do we know which MPI rank returns from nf90_enddef routine early?
Do we know which MPI rank returns from nf90_enddef routine early?
In my last run, it was different. Some of them exited, and others got stuck. It wasn't only 1.
In the collapsed details, the ranks with:
ENTER PROBLEMATIC ENDDEF
- entered the enddef but never exitedEXIT PROBLEMATIC ENDDEF
entered the enddef, and exited while other ranks were waiting foreverOk, thanks. I do not see any pattern in this rank sequence between ranks that got stuck and those that successfully returned from nf90_enddef.
In your description I see you mentioned that compression had no effect on how often this happens, but the number of variables written does have an effect. It also seems that in configurations with smaller domain sizes this does not happen, or not as frequently. So maybe it's worth trying different (smaller) chunk sizes.
In your description I see you mentioned that compression had no effect on how often this happens, but the number of variables written does have an effect. It also seems that in configurations with smaller domain sizes this does not happen, or not as frequently. So maybe it's worth trying different (smaller) chunk sizes.
I personally haven't run those tests, and I know little about the model_configure options for chunking and compression. Can you suggest combinations of options in the module configure?
Here are the relevant lines in my last run. The zstandard_level 4
was my change; that option is absent in the real-time RRFS parallels (which have the same bug). I added compression to speed up testing.
zstandard_level: 4
ideflate: 0
quantize_mode: quantize_bitround
quantize_nsd: 0
ichunk2d: -1
jchunk2d: -1
ichunk3d: -1
jchunk3d: -1
kchunk3d: -1
ichunk2d = -1 (and all other chunk options) means the model will set the values to be the same as the output grid size in corresponding direction. Try to set for ichunk2d/jchunk2d to half of the output grid size, for example. Similar for i,j,k chunk3d. kchunk3d can be for example half of the number of vertical layers.
To be honest I do not see how and why would this make any difference in why nf90_enddef hangs, but who knows.
I found that the model always hangs while writing the physics history file(s) (phyf???.nc). These files have about 260 variables. As you suggested, reducing the number of the output variables in physics seems to help avoid the hangs in nf90_enddef.
Instead of commenting some variables in diag_table, I made this change:
diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..3c3f5e0 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -477,6 +477,11 @@ contains
ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_STOP(ncerr)
end if
+ if (modulo(i,200) == 0) then
+ ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+ ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+ endif
+
end do ! i=1,fieldCount
ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
This change ends the define mode after 200 variables, and immediately reenters the define mode and continues adding the rest of the variables. It seems to work (no hangs) in several test runs I made. (on wcoss2). There is nothing special about number 200. I just choose in randomly to avoid ending/reentering the define mode for files which have less variables.
Can you please try this change with your code/setup on both wcoss2 and jet.
And here are the timings of all history/restart writes from one of my test runs on wcoss2:
dynf000.nc write time is 26.45372 at fcst 00:00
phyf000.nc write time is 34.18413 at fcst 00:00
------- total write time is 60.79570 at Fcst 00:00
dynf001.nc write time is 27.38813 at fcst 01:00
phyf001.nc write time is 36.25545 at fcst 01:00
RESTART/20240304.160000.fv_core.res.tile1.nc write time is 11.98606 at fcst 01:00
RESTART/20240304.160000.fv_srf_wnd.res.tile1.nc write time is 1.12703 at fcst 01:00
RESTART/20240304.160000.fv_tracer.res.tile1.nc write time is 24.19673 at fcst 01:00
RESTART/20240304.160000.phy_data.nc write time is 37.15952 at fcst 01:00
RESTART/20240304.160000.sfc_data.nc write time is 16.17145 at fcst 01:00
------- total write time is 154.44860 at Fcst 01:00
dynf002.nc write time is 29.14509 at fcst 02:00
phyf002.nc write time is 36.68917 at fcst 02:00
RESTART/20240304.170000.fv_core.res.tile1.nc write time is 12.03668 at fcst 02:00
RESTART/20240304.170000.fv_srf_wnd.res.tile1.nc write time is 1.70183 at fcst 02:00
RESTART/20240304.170000.fv_tracer.res.tile1.nc write time is 25.06961 at fcst 02:00
RESTART/20240304.170000.phy_data.nc write time is 35.79864 at fcst 02:00
RESTART/20240304.170000.sfc_data.nc write time is 15.21344 at fcst 02:00
------- total write time is 155.85170 at Fcst 02:00
dynf003.nc write time is 27.02799 at fcst 03:00
phyf003.nc write time is 36.10061 at fcst 03:00
------- total write time is 63.29045 at Fcst 03:00
dynf004.nc write time is 26.55296 at fcst 04:00
phyf004.nc write time is 36.55510 at fcst 04:00
------- total write time is 63.26967 at Fcst 04:00
dynf005.nc write time is 26.85602 at fcst 05:00
phyf005.nc write time is 36.89835 at fcst 05:00
------- total write time is 63.91559 at Fcst 05:00
dynf006.nc write time is 27.17454 at fcst 06:00
phyf006.nc write time is 38.85850 at fcst 06:00
------- total write time is 66.19458 at Fcst 06:00
dynf007.nc write time is 26.85234 at fcst 07:00
phyf007.nc write time is 36.73923 at fcst 07:00
------- total write time is 63.75226 at Fcst 07:00
dynf008.nc write time is 28.33648 at fcst 08:00
phyf008.nc write time is 39.37756 at fcst 08:00
------- total write time is 68.01713 at Fcst 08:00
dynf009.nc write time is 26.56586 at fcst 09:00
phyf009.nc write time is 37.22793 at fcst 09:00
------- total write time is 63.95545 at Fcst 09:00
dynf010.nc write time is 27.55396 at fcst 10:00
phyf010.nc write time is 37.40796 at fcst 10:00
------- total write time is 65.12306 at Fcst 10:00
dynf011.nc write time is 28.12703 at fcst 11:00
phyf011.nc write time is 38.63406 at fcst 11:00
------- total write time is 66.92263 at Fcst 11:00
dynf012.nc write time is 26.92893 at fcst 12:00
phyf012.nc write time is 35.51953 at fcst 12:00
------- total write time is 62.60945 at Fcst 12:00
dynf013.nc write time is 27.23213 at fcst 13:00
phyf013.nc write time is 39.34664 at fcst 13:00
------- total write time is 66.74036 at Fcst 13:00
dynf014.nc write time is 30.29397 at fcst 14:00
phyf014.nc write time is 40.22186 at fcst 14:00
------- total write time is 70.67712 at Fcst 14:00
dynf015.nc write time is 26.69101 at fcst 15:00
phyf015.nc write time is 36.06051 at fcst 15:00
------- total write time is 62.91315 at Fcst 15:00
dynf016.nc write time is 27.40320 at fcst 16:00
phyf016.nc write time is 36.25180 at fcst 16:00
------- total write time is 63.81565 at Fcst 16:00
dynf017.nc write time is 26.70780 at fcst 17:00
phyf017.nc write time is 34.18888 at fcst 17:00
------- total write time is 61.05879 at Fcst 17:00
dynf018.nc write time is 27.22682 at fcst 18:00
phyf018.nc write time is 35.03558 at fcst 18:00
------- total write time is 62.42384 at Fcst 18:00
This did not fix my test case on Jet. Some of the ranks still froze in the nf90_enddef. They froze in the same enddef as before, not the new one you added.
I have a test case on hera now. The PR description has been updated with the path.
Hera: /scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/sudheer-case
Thanks. I'm running that test case on Hera right now with this change (diff is against current head of develop branch):
diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..d3a3433 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -341,7 +341,12 @@ contains
if (lsoil > 1) dimids_soil = [im_dimid,jm_dimid,lsoil_dimid, time_dimid]
end if
+ ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
do i=1, fieldCount
+
+ ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+
call ESMF_FieldGet(fcstField(i), name=fldName, rank=rank, typekind=typekind, rc=rc)
; ESMF_ERR_RETURN(rc)
par_access = NF90_INDEPENDENT
@@ -477,11 +482,11 @@ contains
ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_ST
OP(ncerr)
end if
+ ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
end do ! i=1,fieldCount
- ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
end if
- ! end of define mode
!
! write dimension variables and lon,lat variables
Here for every variable we enter and leave define mode. So far first 4 files (phyf000, 001, 002 and 003) were written without hangs in nf90_enddef.
My run directory is: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/sudheer-case
According to the nc_enddef documentation here, specifically:
_It's not necessary to call nc_enddef() for netCDF-4 files. With netCDF-4 files, ncenddef() is called when needed by the netcdf-4 library.
which means we do not need to call nf90_redef/nf90_enddef at all, since the history files are netCDF-4 files, created with NF90_NETCDF4 mode. @edwardhartnett can you confirm this.
I'll try to remove all nf90_redef/nf90_enddef calls and see what happens.
@DusanJovic-NOAA you are correct, a file created with NC_NETCDF4 does not need to call enddef(), but I believe redef() must still be called.
For example, if you define some metadata, and then call nc_put_vara_float() (or some other data-writing function), then netCDF-4 will notice that you have not called nc_enddef(), and will call it for you.
But does that work for nc_redef()? I don't think so.
However, whether called explicitly by the programmer, or internally by the netCDF library, enddef()/redef() is an expensive operation. All buffers are flushed to disk. So try to write all your metadata (including all attributes), then write data. Don't switch back and forth.
In the case of the fragment of the code I see here, it seems like there's a loop:
for some cases
redef()
write attribute
enddef()
write data
end
What would be better would be two loops, the first to write all the attributes, the second to do all the data writes.
redef()
for some cases
write attribute
end
enddef()
for some cases
write data
end
All of the variable data is written in a later loop except the dimension variables. Those are written in calls to subroutine add_dim inside the metadata-defining loop. It does have the required call to nf90_redef.
if (lm > 1) then
call add_dim(ncid, "pfull", pfull_dimid, wrtgrid, mype, rc)
call add_dim(ncid, "phalf", phalf_dimid, wrtgrid, mype, rc)
... more of the same ...
subroutine add_dim(ncid, dim_name, dimid, grid, mype, rc)
...
ncerr = nf90_def_var(ncid, dim_name, NF90_REAL8, dimids=[dimid], varid=dim_varid); NC_ERR_STOP(ncerr)
...
ncerr = nf90_enddef(ncid=ncid); NC_ERR_STOP(ncerr)
ncerr = nf90_put_var(ncid, dim_varid, values=valueListR8); NC_ERR_STOP(ncerr)
ncerr = nf90_redef(ncid=ncid); NC_ERR_STOP(ncerr)
@edwardhartnett Thanks for the confirmation.
@SamuelTrahanNOAA Yes, all variables are written in the second loop over all fields after all dimensions and attributes are defined and written. The only exception are 4 'dimension variables' or coordinates, (pfull, phalf, zsoil and time) in which case we define them, end define mode, write the coordinate values and reenter define mode. But those are small variables, and I do not think it costs a lot to exit/reenter define mode since there are just 4 of them and no other large variables are written yet. If that has any impact on the performance.
I'll run the test now with all enddef/redef calls removed to see if that works.
Documentation of nc_redef says:
_For netCDF-4 files (i.e. files created with NC_NETCDF4 in the cmode in their call to nc_create()), it is not necessary to call nc_redef() unless the file was also created with NC_STRICT_NC3. For straight-up netCDF-4 files, ncredef() is called automatically, as needed.
OK, so you could take out the redef() and enddef().
Usually when netCDF hangs on a parallel operation it's because a collective operation is done, but not all tasks participated. Are all programs running this metadata code?
Are all programs running this metadata code?
A way to test that is to put an MPI_Barrier before each NetCDF call.
Without any explicit call to nf90_redef/nf90_enddef, model works fine for about 5 hr but then hangs while writing physics history file. Last file (forecast hour 6) is only partially written (~30Mb) before the model hangs:
-rw-r--r-- 1 Dusan.Jovic h-nems 1685751925 Mar 11 18:25 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1865073247 Mar 11 18:29 phyf001.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1878394918 Mar 11 18:33 phyf002.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1881375125 Mar 11 18:37 phyf003.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1876109574 Mar 11 18:41 phyf004.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1879258803 Mar 11 18:46 phyf005.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 30817232 Mar 11 18:49 phyf006.nc
ncdump -h of phyf006.nc prints all metadata and exits without any error. Also comparing metadata and global attributes with nccmp does not report any difference between 005 and 006 files:
nccmp -mg phyf005.nc phyf006.nc
Have we reached the point where we should involve NetCDF and HDF5 developers in this conversation?
Let me try your suggestion to insert an MPI_Barrier before each NetCDF call.
Now it hangs on the second history file (phyf001.nc):
-rw-r--r-- 1 Dusan.Jovic h-nems 1685751925 Mar 11 19:29 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 30817232 Mar 11 19:33 phyf001.nc
Interestingly the file size is exactly the same (30817232 bytes) as in the previous run where model hangs at phyf006.nc. It also never hangs while writing dynf???.nc files, always at phyf???.nc.
Do you know where it is hanging?
You can find out by sshing to one of the compute nodes running your job. Then start gdb on a running process. It may take a few tries to figure out which ranks are associated with the frozen quilt server.
Interestingly the file size is exactly the same (30817232 bytes) as in the previous run where model hangs at phyf006.nc.
I suspect this is the size of the file's metadata.
The only thing that seems to help avoid the hangs is reducing the number of fields written out in the history file. At this moment writing out all the fields specified in 'diag_table', creates 260 variables. What is special about 260?. It is just slightly larger than 256. Could it be that 256 is, for whatever reason, some kind of limit?
I'm running now with just 4 fields commented out in diag_table, the last 4, just to see what happens.
# Aerosols emission for smoke
"gfs_sfc", "emdust", "emdust", "fv3_history2d", "all", .false., "none", 2
"gfs_sfc", "coef_bb_dc", "coef_bb_dc", "fv3_history2d", "all", .false., "none", 2
"gfs_sfc", "min_fplume", "min_fplume", "fv3_history2d", "all", .false., "none", 2
"gfs_sfc", "max_fplume", "max_fplume", "fv3_history2d", "all", .false., "none", 2
"gfs_sfc", "hwp", "hwp", "fv3_history2d", "all", .false., "none", 2
#"gfs_sfc", "hwp_ave", "hwp_ave", "fv3_history2d", "all", .false., "none", 2
#"gfs_sfc", "frp_output", "frp_output", "fv3_history2d", "all", .false., "none", 2
#"gfs_phys", "ebu_smoke", "ebu_smoke", "fv3_history", "all", .false., "none", 2
#"gfs_phys", "ext550", "ext550", "fv3_history", "all", .false., "none", 2
This should create a file with 256 variables.
Disabling only the last two variables (ebu_smoke and ext550) is enough to get it to run reliably. There are other sets of variables one can remove to get it to run reliably. That's just the one I can remember off the top of my head.
Ok, so that means there is nothing special about 256 limit, which is good. That should also mean that there are no issues in nf90_* calls, since in that case (two variables less) everything works fine.
Ok, so that means there is nothing special about 256 limit, which is good. That should also mean that there are no issues in nf90_* calls, since in that case (two variables less) everything works fine.
There must be an issue somewhere in there. The model freezes at an MPI_Allreduce deep within the HDF5 library.
Can we try the gnu compiler with an alternative MPI implementation and still enable NetCDF parallel?
That would eliminate the compiler and MPI implementation as sources of the problem.
We can try that on Hera or Hercules. I was running these tests on Jet.
I recompiled the model on Hera with the gnu compiler, and submitted a job, in:
$ pwd
/scratch1/NCEPDEV/stmp2/Dusan.Jovic/sudheer-case
$ ls -l phyf00*
-rw-r--r-- 1 Dusan.Jovic stmp 30785571 Mar 11 20:27 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic stmp 8 Mar 11 20:27 phyf000.nc-3181117440-28572.lock
-rw-r--r-- 1 Dusan.Jovic stmp 2694034582 Mar 11 20:34 phyf001.nc
Looks like model hangs while writing 0 hour physics file, but I also see this new 'lock' file. Any idea what it is?
You could try:
export HDF5_USE_FILE_LOCKING=OFF
and see if that fixes it.
Is it possible that the model is trying to open the same file twice at the same time?
Is it possible that the model is trying to open the same file twice at the same time?
Do you mean at the same time from different mpi tasks? In parallel mode, all tasks open (create) a file.
No, I mean:
It would explain the problems we're seeing.
This should not happen. If it does it's a bug.
If more than one write group is used (in this case 2 groups are used), different groups will run on separate (non-overlapping) communicators.
So, in this example with 2 write groups:
phyf000.nc will be created by MPI communicator X (Ranks A, B, and C) phyf001.nc will be created by MPI communicator Y (Ranks D, E, and F)
phyf002.nc will be created by MPI communicator X (Ranks A, B, and C), if communicator X is still busy writing the previous file code waits phyf003.nc will be created by MPI communicator Y (Ranks D, E, and F), if communicator Y is still busy writing the previous file code waits
... etc
Perhaps you could try one write group to see if that fixes the problem? It is unlikely, but maybe we'll get lucky.
Hangs with one write group as well.
I also made a change in module_write_netcdf by splitting add_dim routine into two, such that in the first routine (add_dim) dimensions are only defined, and in the second (write_dim) dimension variable data are written. The add_dim is called first, then we explicitly call nf90_enddef (even though it's not strictly necessary for netcdf-4), and then we call write_dim after we exit the define mode. This way we do not leave and reenter the define mode multiple times.
My updated module_write_netcdf is here:
https://github.com/DusanJovic-NOAA/fv3atm/blob/rrfs_write_netcdf_hangs/io/module_write_netcdf.F90
I also changed the access pattern for all variables to NF90_COLLECTIVE. I'm not sure if this is necessary or even desirable, but just for testing.
Unfortunately even with these two changes the model still hangs at random while saving physics history files.
I built the latest versions of hdf5(1.14.3), netcdf-c (4.9.2) and netcdf-fortran (4.6.1) on Hera with GNU and ran the test. I now see this error:
603: HDF5-DIAG: Error detected in HDF5 (1.14.3) MPI-process 603:
603: #000: H5A.c line 2397 in H5Aexists(): can't synchronously check if attribute exists
603: major: Attribute
603: minor: Can't get value
603: #001: H5A.c line 2368 in H5A__exists_api_common(): unable to determine if attribute exists
603: major: Attribute
603: minor: Can't get value
603: #002: H5A.c line 2328 in H5A__exists_common(): unable to determine if attribute exists
603: major: Attribute
603: minor: Can't get value
603: #003: H5VLcallback.c line 1536 in H5VL_attr_specific(): unable to execute attribute 'specific' callback
603: major: Virtual Object Layer
603: minor: Can't operate on object
603: #004: H5VLcallback.c line 1502 in H5VL__attr_specific(): unable to execute attribute 'specific' callback
603: major: Virtual Object Layer
603: minor: Can't operate on object
603: #005: H5VLnative_attr.c line 473 in H5VL__native_attr_specific(): unable to determine if attribute exists
603: major: Attribute
603: minor: Can't get value
603: #006: H5Oattribute.c line 1732 in H5O__attr_exists(): error checking for existence of attribute
603: major: Attribute
603: minor: Iteration failed
603: #007: H5Adense.c line 1679 in H5A__dense_exists(): can't search for attribute in name index
603: major: Attribute
603: minor: Object not found
603: #008: H5B2.c line 609 in H5B2_find(): can't compare btree2 records
603: major: B-Tree node
603: minor: Can't compare objects
603: #009: H5B2int.c line 104 in H5B2__locate_record(): can't compare btree2 records
603: major: B-Tree node
603: minor: Can't compare objects
603: #010: H5Abtree2.c line 264 in H5A__dense_btree2_name_compare(): can't compare btree2 records
603: major: Heap
603: minor: Can't compare objects
603: #011: H5HF.c line 662 in H5HF_op(): can't operate on 'huge' object from fractal heap
603: major: Heap
603: minor: Can't operate on object
603: #012: H5HFhuge.c line 918 in H5HF__huge_op(): unable to operate on heap object
603: major: Heap
603: minor: Can't operate on object
603: #013: H5HFhuge.c line 770 in H5HF__huge_op_real(): application's callback failed
603: major: Heap
603: minor: Can't operate on object
603: #014: H5Abtree2.c line 154 in H5A__dense_fh_name_cmp(): can't decode attribute
603: major: Object header
603: minor: Unable to decode value
603: #015: H5Omessage.c line 1636 in H5O_msg_decode(): unable to decode message
603: major: Object header
603: minor: Unable to decode value
603: #016: H5Oshared.h line 74 in H5O__attr_shared_decode(): unable to decode native message
603: major: Object header
603: minor: Unable to decode value
603: #017: H5Oattr.c line 277 in H5O__attr_decode(): ran off end of input buffer while decoding
603: major: Object header
603: minor: Address overflowed
I also repeated a test in which I commented out two last fields in 'diag_table' and it worked, for few output hours without hanging. But we already know that this is expected to work.
I also repeated a test in which I commented out two last fields in 'diag_table' and it worked, for few output hours without hanging. But we already know that this is expected to work.
When you comment out those fields, does it eliminate the error messages?
I also repeated a test in which I commented out two last fields in 'diag_table' and it worked, for few output hours without hanging. But we already know that this is expected to work.
When you comment out those fields, does it eliminate the error messages?
Yes.
This sounds like something specific enough to do a bug report for the NetCDF library developers.
In the err file I also see:
627: file: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/rrfs_netcdf_hangs/ufs-weather-model/FV3
/io/module_write_netcdf.F90 line: 499 NetCDF: Problem with HDF5 dimscales.
After a little bit of grepping and printf debugging I found that the error (Problem with HDF5 dimscales) is returned from:
This function (attach_dimscales) is conditionally called from:
Looks like this attaching can be disabled based on no_dimscale_attach, which can be set to true based on NC_NODIMSCALE_ATTACH flag:
Unfortunately netcdf-fortran does not provide fortran version of this constant, so I defined it locally in write_netcdf routine as:
integer, parameter :: NF90_NODIMSCALE_ATTACH = int(Z'40000')
and when I create netcdf file as:
ncerr = nf90_create(trim(filename),&
cmode=IOR(IOR(NF90_CLOBBER,netcdf_file_type),NF90_NODIMSCALE_ATTACH),&
comm=mpi_comm, info = MPI_INFO_NULL, ncid=ncid); NC_ERR_STOP(ncerr)
and run the test it seems to work. Obviously we need to run a full 24h run, many times, to verify that this indeed solves the problem. Temporarily. And we need to find correct, more permanent solution. But let's see if this is the real cause of the hangs.
I ran this test with GNU on Hera, I'll also test it with Intel.
My code updates are here:
https://github.com/NOAA-EMC/fv3atm/compare/develop...DusanJovic-NOAA:fv3atm:rrfs_write_netcdf_hangs
@DusanJovic-NOAA Thanks for debugging the issue! Once it is confirmed the issue is fixed, maybe we can ask @edwardhartnett to update the netcdf-c.
Mostly when users think they have found a bug in netCDF, they are mistaken. NetCDF code is well-tested.
If you think you have found a netCDF bug, we need a (one-file) test program which demonstrates it. This should be a unit test of the write component, and should remain a unit test once we get all this sorted out. If a future release of netCDF/HDF-5 breaks the test, you know that something important has changed and your code won't work. (And the most likely scenario is that while constructing such a test, you will find that netCDF is actually working the way it is supposed to.)
Here's an example from 2020 concerning the fv3 code: https://github.com/Unidata/netcdf-c/blob/main/nc_perf/tst_compress_par.c
In this case, there was a belief that there were bugs in netCDF relating to compression and parallel writes. I took a bunch of fv3 IO code, munged it into a one-file test, and demonstrated that netCDF was working just fine with parallel compression. I put this test in netcdf-c because there was no way at the time to put it into FV3. Lack of unit testing at this granularity for FV3 made debugging I/O a slow and painful process. Lack of unit testing costs the organization.
This write component is all about writing netCDF data. Is there a test for it? If not, now's the time to add the first test.
@DusanJovic-NOAA @SamuelTrahanNOAA we need a one-file test program which demonstrates how the write components uses netCDF parallel I/O. This will either cause you to find a bug in the write component, or demonstrate a netCDF bug. Can you produce that test?
Mostly when users think they have found a bug in netCDF, they are mistaken
No. Not a bug. This is a missing feature. The Fortran library doesn't forward the NC_NODIMSCALE_ATTACH to Fortran.
integer, parameter :: NF90_NODIMSCALE_ATTACH = int(Z'40000')
We can make a feature request to get that added. No further proof is needed.
The lack of NODIMSCALE_ATTACH should not cause a problem.
Dimscales are a HDF5 feature used to keep track of dimensions. However, it performs poorly at scale, so I added a way to ignore dimscales. But that should already be happening for you. (That is, you do not have to turn this optimization on, it will be used for all new files automatically.)
Also, the dimscales should not get out of sync. Turning off the dimscales should improve performance opening files with many (i.e. hundreds or more) variables, but should never show the error that you found.
I suspect there is some problem in your metadata code. Or, perhaps, you have really found a problem in netCDF.
Description
The head of develop hangs while writing NetCDF output files in the write component when running the version of RRFS planned for operations. This happens regardless of the compression settings or lack thereof. The behavior is like so:
Commenting out some of the variables in the diag_table will prevent this problem. There isn't one specific set of variables that seem to cause it. Turning off the lake model or smoke model prevents the hang, but one should note that disables writing of many variables.
Using one thread (no OpenMP) appears to reduce the frequency of the hangs. Increasing the write component ranks by enormous amounts appears to increase the frequency of hangs. This conclusion is uncertain since we haven't run enough tests to get a statistically representative sample set.
I have been unable to reproduce the problem when the model is compiled in debug mode.
This problem has been confirmed on Jet, Hera, and WCOSS2, but hasn't been tested on other machines.
From lots of forum searching, this problem has been identified in the distant past when the model sends different metadata at different ranks. For example, 13 variables on one rank, but 14 on the others. Or one rank sends three attributes and the others sent five. I haven't investigated that possibility, but I don't see how it is possible in the code.
To Reproduce:
1. Executables were compiled like so:
2. Copy one of these test directories:
Jet:
/lfs4/BMC/nrtrr/Samuel.Trahan/smoke/sudheer-case
Hera:/scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/sudheer-case
Cactus:/lfs/h2/oar/esrl/noscrub/samuel.trahan/ming-io-hang
3. Edit the job script
Each machine's test directory contains a
job.sh
script. Edit it as needed to point to your code.4. Run the job script.
Send the script to
sbatch
on Jet orqsub
on Cactus. Do not run it on a login node.Additional context
This problem exists in the version of RRFS planned to go operational.
Output
This stack trace comes from gdb analyzing a running write component MPI rank while it is hanging waiting for an MPI_Allreduce. The arguments in the stack trace may be meaningless because gdb has trouble interpreting Intel-compiled code. However, the line numbers and function calls should be correct. Some may have been optimized out.
stack trace of stuck MPI process
``` #0 0x00002b6eab22803a in MPIDI_SHMGR_release_generic (opcode=2893772520, mpir_comm=0x7ffca32a54c8, root=27, localbuf=0x1ec, count=-1405432336, datatype=1329139008, errflag=0x7ffca32b2548, knomial_factor=4, algo_type=MPIDI_SHMGR_ALGO_FLAT) at ../../src/mpid/ch4/src/intel/ch4_shm_coll_templates.h:206 #1 0x00002b6eab21bf85 in MPIDI_SHMGR_Release_bcast (comm=0x2b6eac7b76e8