Write component hangs in nf90_enddef with planned operational RRFS

SamuelTrahanNOAA commented 3 months ago

Description

The head of develop hangs while writing NetCDF output files in the write component when running the version of RRFS planned for operations. This happens regardless of the compression settings or lack thereof. The behavior is like so:

Model runs normally.
Data is sent to a write component.
The write component writes metadata to the file.
The write component gets stuck in an nf90_enddef call. Some MPI ranks leave the call prematurely and the rest are stuck in an MPI_Allreduce within nf90_enddef. This is on line 483 of FV3/io/module_write_netcdf.F90 (see stack trace).

Commenting out some of the variables in the diag_table will prevent this problem. There isn't one specific set of variables that seem to cause it. Turning off the lake model or smoke model prevents the hang, but one should note that disables writing of many variables.

Using one thread (no OpenMP) appears to reduce the frequency of the hangs. Increasing the write component ranks by enormous amounts appears to increase the frequency of hangs. This conclusion is uncertain since we haven't run enough tests to get a statistically representative sample set.

I have been unable to reproduce the problem when the model is compiled in debug mode.

This problem has been confirmed on Jet, Hera, and WCOSS2, but hasn't been tested on other machines.

From lots of forum searching, this problem has been identified in the distant past when the model sends different metadata at different ranks. For example, 13 variables on one rank, but 14 on the others. Or one rank sends three attributes and the others sent five. I haven't investigated that possibility, but I don't see how it is possible in the code.

To Reproduce:

1. Executables were compiled like so:

target=jet # or wcoss2
opts="-DAPP=HAFSW -DCCPP_SUITES=FV3_HRRR_gf,FV3_global_nest_v1 -D32BIT=ON -DCCPP_32BIT=ON -DFASTER=ON"
./compile.sh "$target" "$opts" 32bit intel YES NO

2. Copy one of these test directories:

Jet: /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/sudheer-case Hera: /scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/sudheer-case Cactus: /lfs/h2/oar/esrl/noscrub/samuel.trahan/ming-io-hang

3. Edit the job script

Each machine's test directory contains a job.sh script. Edit it as needed to point to your code.

4. Run the job script.

Send the script to sbatch on Jet or qsub on Cactus. Do not run it on a login node.

Additional context

This problem exists in the version of RRFS planned to go operational.

Output

This stack trace comes from gdb analyzing a running write component MPI rank while it is hanging waiting for an MPI_Allreduce. The arguments in the stack trace may be meaningless because gdb has trouble interpreting Intel-compiled code. However, the line numbers and function calls should be correct. Some may have been optimized out.

stack trace of stuck MPI process

``` #0 0x00002b6eab22803a in MPIDI_SHMGR_release_generic (opcode=2893772520, mpir_comm=0x7ffca32a54c8, root=27, localbuf=0x1ec, count=-1405432336, datatype=1329139008, errflag=0x7ffca32b2548, knomial_factor=4, algo_type=MPIDI_SHMGR_ALGO_FLAT) at ../../src/mpid/ch4/src/intel/ch4_shm_coll_templates.h:206 #1 0x00002b6eab21bf85 in MPIDI_SHMGR_Release_bcast (comm=0x2b6eac7b76e8 , buf=0x7ffca32a54c8, count=27, datatype=492, errflag=0x2b6eac3acdf0 , algo_type=1329139008, radix=0) at ../../src/mpid/ch4/src/intel/ch4_shm_coll.c:2619 #2 0x00002b6eab118690 in MPIDI_Allreduce_intra_composition_zeta (sendbuf=, recvbuf=, count=, datatype=, op=, comm_ptr=, errflag=, ch4_algo_parameters_container=) at ../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1078 #3 MPID_Allreduce_invoke (sendbuf=0xffffffffffffffff, recvbuf=0x7ffca32b2598, count=, datatype=, op=, comm=, errflag=, ch4_algo_parameters_container=) at ../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:1831 #4 MPIDI_coll_invoke (coll_sig=0x2b6eac7b76e8 , container=0x7ffca32a54c8, req=0x1b) at ../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3359 #5 0x00002b6eab0f47ec in MPIDI_coll_select (coll_sig=0x2b6eac7b76e8 , req=0x7ffca32a54c8) at ../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130 #6 0x00002b6eab237387 in MPID_Allreduce (sendbuf=, recvbuf=, count=, datatype=, op=, comm=, errflag=) at ../../src/mpid/ch4/src/intel/ch4_coll.h:77 #7 MPIR_Allreduce (sendbuf=0x2b6eac7b76e8 , recvbuf=0x7ffca32a54c8, count=27, datatype=492, op=-1405432336, comm_ptr=0x2b6f4f390d40, errflag=0x7ffca32b2548) at ../../src/mpi/coll/intel/coll_impl.c:265 #8 0x00002b6eab0926e1 in PMPI_Allreduce (sendbuf=0x2b6eac7b76e8 , recvbuf=0x7ffca32a54c8, count=27, datatype=492, op=-1405432336, comm=1329139008) at ../../src/mpi/coll/allreduce/allreduce.c:417 #9 0x00002b6eabbade7e in PMPI_File_set_size (fh=0x2b6eac7b76e8 , size=140723045946568) at ../../../../../src/mpi/romio/mpi-io/set_size.c:69 #10 0x00002b6eafcc20e5 in H5FD__mpio_truncate () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310 #11 0x00002b6eafca8445 in H5FD_truncate () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310 #12 0x00002b6eafc922ad in H5F__flush () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310 #13 0x00002b6eafc964ee in H5F_flush_mounts () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310 #14 0x00002b6eafefb4ad in H5VL__native_file_specific () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310 #15 0x00002b6eafee6571 in H5VL_file_specific () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310 #16 0x00002b6eafc7ec02 in H5Fflush () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/hdf5-1.14.0-7pcehjm/lib/libhdf5.so.310 #17 0x00002b6ea848fe82 in nc4_enddef_netcdf4_file () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/netcdf-c-4.9.2-lg6bcpf/lib/libnetcdf.so.19 #18 0x00002b6ea848fe09 in NC4__enddef () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/netcdf-c-4.9.2-lg6bcpf/lib/libnetcdf.so.19 #19 0x00002b6ea8407d50 in nc_enddef () from /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/intel/2021.5.0/netcdf-c-4.9.2-lg6bcpf/lib/libnetcdf.so.19 #20 0x00002b6ea7f841fb in netcdf::nf90_enddef (ncid=-532069714, h_minfree=-532069809, v_align=, v_minfree=, r_align=1) at ./netcdf_file.F90:82 #21 0x0000000002bcb8d0 in get_dimlen_if_exists (ncid=, dim_name=, grid=..., dim_len=, rc=, .tmp.DIM_NAME.len_V$5138=) at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_write_netcdf.F90:483 #22 module_write_netcdf::write_netcdf (wrtfb=..., filename='\000' , '\061\000\000\000\000\000\000\000int\000\270\177\000\000\330\027\272[\270\177\000\000\000\000\000\000\000\000\000\000!\001\000\000\000\000\000\000\060\000\000\000\000\000\000\000\061', '\000' , '\340^]\b', '\000' , '\241\000\000\000\000\000\000\000\001\000\000\000\t\000\000\000\300\'+\243\374\177', '\000' , '\001', '\000' ..., use_parallel_netcdf=, mpi_comm=, mype=1, grid_id=64, rc=0, .tmp.FILENAME.len_V$3649=10) at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_write_netcdf.F90:208 #23 0x00000000027782f6 in rtll (tlmd=, tphd=, almd=, aphd=, tlm0d=, tph0d=) at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_wrt_grid_comp.F90:2422 #24 module_wrt_grid_comp::wrt_run (wrt_comp=..., imp_state_write=..., exp_state_write=, clock=, rc=1) at /lfs4/BMC/nrtrr/Samuel.Trahan/smoke/community-20240228/FV3/io/module_wrt_grid_comp.F90:2052 #25 0x0000000000acc092 in ESMCI::FTable::callVFuncPtr(char const*, ESMCI::VM*, int*) () at /mnt/lfs4/HFIP/hfv3gfs/role.epic/spack-stack/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-6zav654sh2mjenj4s3h4w433vhg5oqzy/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167 ```

SamuelTrahanNOAA commented 3 months ago

I'm pinging @DusanJovic-NOAA and @junwang-noaa hoping they have some guesses.

DusanJovic-NOAA commented 3 months ago

Do we know which MPI rank returns from nf90_enddef routine early?

SamuelTrahanNOAA commented 3 months ago

Do we know which MPI rank returns from nf90_enddef routine early?

In my last run, it was different. Some of them exited, and others got stuck. It wasn't only 1.

In the collapsed details, the ranks with:

ENTER PROBLEMATIC ENDDEF - entered the enddef but never exited
EXIT PROBLEMATIC ENDDEF entered the enddef, and exited while other ranks were waiting forever

Who exited the nf90_enddef?

``` for n in $( seq 690 779 ) ; do grep -E "^$n:" slurm-879847.out | tail -1; done 690: ENTER PROBLEMATIC ENDDEF 691: ENTER PROBLEMATIC ENDDEF 692: ENTER PROBLEMATIC ENDDEF 693: ENTER PROBLEMATIC ENDDEF 694: ENTER PROBLEMATIC ENDDEF 695: ENTER PROBLEMATIC ENDDEF 696: EXIT PROBLEMATIC ENDDEF 697: EXIT PROBLEMATIC ENDDEF 698: EXIT PROBLEMATIC ENDDEF 699: EXIT PROBLEMATIC ENDDEF 700: EXIT PROBLEMATIC ENDDEF 701: EXIT PROBLEMATIC ENDDEF 702: EXIT PROBLEMATIC ENDDEF 703: EXIT PROBLEMATIC ENDDEF 704: EXIT PROBLEMATIC ENDDEF 705: EXIT PROBLEMATIC ENDDEF 706: EXIT PROBLEMATIC ENDDEF 707: EXIT PROBLEMATIC ENDDEF 708: EXIT PROBLEMATIC ENDDEF 709: EXIT PROBLEMATIC ENDDEF 710: EXIT PROBLEMATIC ENDDEF 711: EXIT PROBLEMATIC ENDDEF 712: ENTER PROBLEMATIC ENDDEF 713: ENTER PROBLEMATIC ENDDEF 714: ENTER PROBLEMATIC ENDDEF 715: ENTER PROBLEMATIC ENDDEF 716: ENTER PROBLEMATIC ENDDEF 717: ENTER PROBLEMATIC ENDDEF 718: ENTER PROBLEMATIC ENDDEF 719: ENTER PROBLEMATIC ENDDEF 720: EXIT PROBLEMATIC ENDDEF 721: EXIT PROBLEMATIC ENDDEF 722: EXIT PROBLEMATIC ENDDEF 723: EXIT PROBLEMATIC ENDDEF 724: EXIT PROBLEMATIC ENDDEF 725: EXIT PROBLEMATIC ENDDEF 726: EXIT PROBLEMATIC ENDDEF 727: EXIT PROBLEMATIC ENDDEF 728: ENTER PROBLEMATIC ENDDEF 729: ENTER PROBLEMATIC ENDDEF 730: ENTER PROBLEMATIC ENDDEF 731: ENTER PROBLEMATIC ENDDEF 732: ENTER PROBLEMATIC ENDDEF 733: ENTER PROBLEMATIC ENDDEF 734: ENTER PROBLEMATIC ENDDEF 735: ENTER PROBLEMATIC ENDDEF 736: EXIT PROBLEMATIC ENDDEF 737: EXIT PROBLEMATIC ENDDEF 738: EXIT PROBLEMATIC ENDDEF 739: EXIT PROBLEMATIC ENDDEF 740: EXIT PROBLEMATIC ENDDEF 741: EXIT PROBLEMATIC ENDDEF 742: EXIT PROBLEMATIC ENDDEF 743: EXIT PROBLEMATIC ENDDEF 744: EXIT PROBLEMATIC ENDDEF 745: EXIT PROBLEMATIC ENDDEF 746: EXIT PROBLEMATIC ENDDEF 747: EXIT PROBLEMATIC ENDDEF 748: EXIT PROBLEMATIC ENDDEF 749: EXIT PROBLEMATIC ENDDEF 750: EXIT PROBLEMATIC ENDDEF 751: EXIT PROBLEMATIC ENDDEF 752: EXIT PROBLEMATIC ENDDEF 753: EXIT PROBLEMATIC ENDDEF 754: EXIT PROBLEMATIC ENDDEF 755: EXIT PROBLEMATIC ENDDEF 756: EXIT PROBLEMATIC ENDDEF 757: EXIT PROBLEMATIC ENDDEF 758: EXIT PROBLEMATIC ENDDEF 759: ENTER PROBLEMATIC ENDDEF 760: ENTER PROBLEMATIC ENDDEF 761: ENTER PROBLEMATIC ENDDEF 762: ENTER PROBLEMATIC ENDDEF 763: ENTER PROBLEMATIC ENDDEF 764: ENTER PROBLEMATIC ENDDEF 765: EXIT PROBLEMATIC ENDDEF 766: EXIT PROBLEMATIC ENDDEF 767: ENTER PROBLEMATIC ENDDEF 768: EXIT PROBLEMATIC ENDDEF 769: ENTER PROBLEMATIC ENDDEF 770: EXIT PROBLEMATIC ENDDEF 771: ENTER PROBLEMATIC ENDDEF 772: ENTER PROBLEMATIC ENDDEF 773: EXIT PROBLEMATIC ENDDEF 774: EXIT PROBLEMATIC ENDDEF 775: EXIT PROBLEMATIC ENDDEF 776: EXIT PROBLEMATIC ENDDEF 777: EXIT PROBLEMATIC ENDDEF 778: EXIT PROBLEMATIC ENDDEF 779: EXIT PROBLEMATIC ENDDEF ```

DusanJovic-NOAA commented 3 months ago

Ok, thanks. I do not see any pattern in this rank sequence between ranks that got stuck and those that successfully returned from nf90_enddef.

DusanJovic-NOAA commented 3 months ago

In your description I see you mentioned that compression had no effect on how often this happens, but the number of variables written does have an effect. It also seems that in configurations with smaller domain sizes this does not happen, or not as frequently. So maybe it's worth trying different (smaller) chunk sizes.

SamuelTrahanNOAA commented 3 months ago

In your description I see you mentioned that compression had no effect on how often this happens, but the number of variables written does have an effect. It also seems that in configurations with smaller domain sizes this does not happen, or not as frequently. So maybe it's worth trying different (smaller) chunk sizes.

I personally haven't run those tests, and I know little about the model_configure options for chunking and compression. Can you suggest combinations of options in the module configure?

Here are the relevant lines in my last run. The zstandard_level 4 was my change; that option is absent in the real-time RRFS parallels (which have the same bug). I added compression to speed up testing.

zstandard_level:         4
ideflate:                0
quantize_mode:           quantize_bitround
quantize_nsd:            0
ichunk2d:                -1
jchunk2d:                -1
ichunk3d:                -1
jchunk3d:                -1
kchunk3d:                -1

DusanJovic-NOAA commented 3 months ago

ichunk2d = -1 (and all other chunk options) means the model will set the values to be the same as the output grid size in corresponding direction. Try to set for ichunk2d/jchunk2d to half of the output grid size, for example. Similar for i,j,k chunk3d. kchunk3d can be for example half of the number of vertical layers.

To be honest I do not see how and why would this make any difference in why nf90_enddef hangs, but who knows.

DusanJovic-NOAA commented 3 months ago

I found that the model always hangs while writing the physics history file(s) (phyf???.nc). These files have about 260 variables. As you suggested, reducing the number of the output variables in physics seems to help avoid the hangs in nf90_enddef.

Instead of commenting some variables in diag_table, I made this change:

diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..3c3f5e0 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -477,6 +477,11 @@ contains
             ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_STOP(ncerr)
          end if
 
+         if (modulo(i,200) == 0) then
+           ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+           ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+         endif
+
        end do   ! i=1,fieldCount
 
        ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)

This change ends the define mode after 200 variables, and immediately reenters the define mode and continues adding the rest of the variables. It seems to work (no hangs) in several test runs I made. (on wcoss2). There is nothing special about number 200. I just choose in randomly to avoid ending/reentering the define mode for files which have less variables.

Can you please try this change with your code/setup on both wcoss2 and jet.

DusanJovic-NOAA commented 3 months ago

And here are the timings of all history/restart writes from one of my test runs on wcoss2:

                                              dynf000.nc write time is   26.45372 at fcst   00:00
                                              phyf000.nc write time is   34.18413 at fcst   00:00
                                           ------- total write time is   60.79570 at Fcst   00:00
                                              dynf001.nc write time is   27.38813 at fcst   01:00
                                              phyf001.nc write time is   36.25545 at fcst   01:00
            RESTART/20240304.160000.fv_core.res.tile1.nc write time is   11.98606 at fcst   01:00
         RESTART/20240304.160000.fv_srf_wnd.res.tile1.nc write time is    1.12703 at fcst   01:00
          RESTART/20240304.160000.fv_tracer.res.tile1.nc write time is   24.19673 at fcst   01:00
                     RESTART/20240304.160000.phy_data.nc write time is   37.15952 at fcst   01:00
                     RESTART/20240304.160000.sfc_data.nc write time is   16.17145 at fcst   01:00
                                           ------- total write time is  154.44860 at Fcst   01:00
                                              dynf002.nc write time is   29.14509 at fcst   02:00
                                              phyf002.nc write time is   36.68917 at fcst   02:00
            RESTART/20240304.170000.fv_core.res.tile1.nc write time is   12.03668 at fcst   02:00
         RESTART/20240304.170000.fv_srf_wnd.res.tile1.nc write time is    1.70183 at fcst   02:00
          RESTART/20240304.170000.fv_tracer.res.tile1.nc write time is   25.06961 at fcst   02:00
                     RESTART/20240304.170000.phy_data.nc write time is   35.79864 at fcst   02:00
                     RESTART/20240304.170000.sfc_data.nc write time is   15.21344 at fcst   02:00
                                           ------- total write time is  155.85170 at Fcst   02:00
                                              dynf003.nc write time is   27.02799 at fcst   03:00
                                              phyf003.nc write time is   36.10061 at fcst   03:00
                                           ------- total write time is   63.29045 at Fcst   03:00
                                              dynf004.nc write time is   26.55296 at fcst   04:00
                                              phyf004.nc write time is   36.55510 at fcst   04:00
                                           ------- total write time is   63.26967 at Fcst   04:00
                                              dynf005.nc write time is   26.85602 at fcst   05:00
                                              phyf005.nc write time is   36.89835 at fcst   05:00
                                           ------- total write time is   63.91559 at Fcst   05:00
                                              dynf006.nc write time is   27.17454 at fcst   06:00
                                              phyf006.nc write time is   38.85850 at fcst   06:00
                                           ------- total write time is   66.19458 at Fcst   06:00
                                              dynf007.nc write time is   26.85234 at fcst   07:00
                                              phyf007.nc write time is   36.73923 at fcst   07:00
                                           ------- total write time is   63.75226 at Fcst   07:00
                                              dynf008.nc write time is   28.33648 at fcst   08:00
                                              phyf008.nc write time is   39.37756 at fcst   08:00
                                           ------- total write time is   68.01713 at Fcst   08:00
                                              dynf009.nc write time is   26.56586 at fcst   09:00
                                              phyf009.nc write time is   37.22793 at fcst   09:00
                                           ------- total write time is   63.95545 at Fcst   09:00
                                              dynf010.nc write time is   27.55396 at fcst   10:00
                                              phyf010.nc write time is   37.40796 at fcst   10:00
                                           ------- total write time is   65.12306 at Fcst   10:00
                                              dynf011.nc write time is   28.12703 at fcst   11:00
                                              phyf011.nc write time is   38.63406 at fcst   11:00
                                           ------- total write time is   66.92263 at Fcst   11:00
                                              dynf012.nc write time is   26.92893 at fcst   12:00
                                              phyf012.nc write time is   35.51953 at fcst   12:00
                                           ------- total write time is   62.60945 at Fcst   12:00
                                              dynf013.nc write time is   27.23213 at fcst   13:00
                                              phyf013.nc write time is   39.34664 at fcst   13:00
                                           ------- total write time is   66.74036 at Fcst   13:00
                                              dynf014.nc write time is   30.29397 at fcst   14:00
                                              phyf014.nc write time is   40.22186 at fcst   14:00
                                           ------- total write time is   70.67712 at Fcst   14:00
                                              dynf015.nc write time is   26.69101 at fcst   15:00
                                              phyf015.nc write time is   36.06051 at fcst   15:00
                                           ------- total write time is   62.91315 at Fcst   15:00
                                              dynf016.nc write time is   27.40320 at fcst   16:00
                                              phyf016.nc write time is   36.25180 at fcst   16:00
                                           ------- total write time is   63.81565 at Fcst   16:00
                                              dynf017.nc write time is   26.70780 at fcst   17:00
                                              phyf017.nc write time is   34.18888 at fcst   17:00
                                           ------- total write time is   61.05879 at Fcst   17:00
                                              dynf018.nc write time is   27.22682 at fcst   18:00
                                              phyf018.nc write time is   35.03558 at fcst   18:00
                                           ------- total write time is   62.42384 at Fcst   18:00

SamuelTrahanNOAA commented 3 months ago

This did not fix my test case on Jet. Some of the ranks still froze in the nf90_enddef. They froze in the same enddef as before, not the new one you added.

Which ranks got stuck this time?

``` for n in $( seq 690 779 ) ; do grep -E "^$n:" slurm-973875.out | tail -1; done 690: ENTER PROBLEMATIC ENDDEF 691: ENTER PROBLEMATIC ENDDEF 692: ENTER PROBLEMATIC ENDDEF 693: EXIT PROBLEMATIC ENDDEF 694: ENTER PROBLEMATIC ENDDEF 695: EXIT PROBLEMATIC ENDDEF 696: ENTER PROBLEMATIC ENDDEF 697: ENTER PROBLEMATIC ENDDEF 698: ENTER PROBLEMATIC ENDDEF 699: ENTER PROBLEMATIC ENDDEF 700: ENTER PROBLEMATIC ENDDEF 701: ENTER PROBLEMATIC ENDDEF 702: ENTER PROBLEMATIC ENDDEF 703: ENTER PROBLEMATIC ENDDEF 704: ENTER PROBLEMATIC ENDDEF 705: ENTER PROBLEMATIC ENDDEF 706: ENTER PROBLEMATIC ENDDEF 707: ENTER PROBLEMATIC ENDDEF 708: ENTER PROBLEMATIC ENDDEF 709: ENTER PROBLEMATIC ENDDEF 710: ENTER PROBLEMATIC ENDDEF 711: ENTER PROBLEMATIC ENDDEF 712: ENTER PROBLEMATIC ENDDEF 713: ENTER PROBLEMATIC ENDDEF 714: ENTER PROBLEMATIC ENDDEF 715: ENTER PROBLEMATIC ENDDEF 716: ENTER PROBLEMATIC ENDDEF 717: ENTER PROBLEMATIC ENDDEF 718: ENTER PROBLEMATIC ENDDEF 719: ENTER PROBLEMATIC ENDDEF 720: ENTER PROBLEMATIC ENDDEF 721: ENTER PROBLEMATIC ENDDEF 722: ENTER PROBLEMATIC ENDDEF 723: ENTER PROBLEMATIC ENDDEF 724: ENTER PROBLEMATIC ENDDEF 725: ENTER PROBLEMATIC ENDDEF 726: ENTER PROBLEMATIC ENDDEF 727: ENTER PROBLEMATIC ENDDEF 728: ENTER PROBLEMATIC ENDDEF 729: ENTER PROBLEMATIC ENDDEF 730: ENTER PROBLEMATIC ENDDEF 731: ENTER PROBLEMATIC ENDDEF 732: ENTER PROBLEMATIC ENDDEF 733: ENTER PROBLEMATIC ENDDEF 734: ENTER PROBLEMATIC ENDDEF 735: ENTER PROBLEMATIC ENDDEF 736: ENTER PROBLEMATIC ENDDEF 737: ENTER PROBLEMATIC ENDDEF 738: ENTER PROBLEMATIC ENDDEF 739: ENTER PROBLEMATIC ENDDEF 740: ENTER PROBLEMATIC ENDDEF 741: ENTER PROBLEMATIC ENDDEF 742: ENTER PROBLEMATIC ENDDEF 743: ENTER PROBLEMATIC ENDDEF 744: ENTER PROBLEMATIC ENDDEF 745: ENTER PROBLEMATIC ENDDEF 746: ENTER PROBLEMATIC ENDDEF 747: ENTER PROBLEMATIC ENDDEF 748: ENTER PROBLEMATIC ENDDEF 749: ENTER PROBLEMATIC ENDDEF 750: ENTER PROBLEMATIC ENDDEF 751: ENTER PROBLEMATIC ENDDEF 752: ENTER PROBLEMATIC ENDDEF 753: ENTER PROBLEMATIC ENDDEF 754: ENTER PROBLEMATIC ENDDEF 755: ENTER PROBLEMATIC ENDDEF 756: ENTER PROBLEMATIC ENDDEF 757: ENTER PROBLEMATIC ENDDEF 758: ENTER PROBLEMATIC ENDDEF 759: ENTER PROBLEMATIC ENDDEF 760: ENTER PROBLEMATIC ENDDEF 761: ENTER PROBLEMATIC ENDDEF 762: ENTER PROBLEMATIC ENDDEF 763: ENTER PROBLEMATIC ENDDEF 764: ENTER PROBLEMATIC ENDDEF 765: ENTER PROBLEMATIC ENDDEF 766: ENTER PROBLEMATIC ENDDEF 767: ENTER PROBLEMATIC ENDDEF 768: ENTER PROBLEMATIC ENDDEF 769: ENTER PROBLEMATIC ENDDEF 770: ENTER PROBLEMATIC ENDDEF 771: ENTER PROBLEMATIC ENDDEF 772: ENTER PROBLEMATIC ENDDEF 773: ENTER PROBLEMATIC ENDDEF 774: ENTER PROBLEMATIC ENDDEF 775: ENTER PROBLEMATIC ENDDEF 776: ENTER PROBLEMATIC ENDDEF 777: ENTER PROBLEMATIC ENDDEF 778: ENTER PROBLEMATIC ENDDEF 779: ENTER PROBLEMATIC ENDDEF ```

SamuelTrahanNOAA commented 3 months ago

I have a test case on hera now. The PR description has been updated with the path.

Hera: /scratch2/BMC/wrfruc/Samuel.Trahan/rrfs/sudheer-case

DusanJovic-NOAA commented 3 months ago

Thanks. I'm running that test case on Hera right now with this change (diff is against current head of develop branch):

diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index d9d8ff9..d3a3433 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -341,7 +341,12 @@ contains
           if (lsoil > 1) dimids_soil = [im_dimid,jm_dimid,lsoil_dimid,           time_dimid]
        end if
  
+       ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
        do i=1, fieldCount
+
+         ncerr = nf90_redef(ncid); NC_ERR_STOP(ncerr)
+
          call ESMF_FieldGet(fcstField(i), name=fldName, rank=rank, typekind=typekind, rc=rc)
; ESMF_ERR_RETURN(rc)
  
          par_access = NF90_INDEPENDENT
@@ -477,11 +482,11 @@ contains
             ncerr = nf90_put_att(ncid, varids(i), 'grid_mapping', 'cubed_sphere'); NC_ERR_ST
OP(ncerr)
          end if
  
+         ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
+
        end do   ! i=1,fieldCount
  
-       ncerr = nf90_enddef(ncid); NC_ERR_STOP(ncerr)
     end if
-    ! end of define mode
  
     !
     ! write dimension variables and lon,lat variables

Here for every variable we enter and leave define mode. So far first 4 files (phyf000, 001, 002 and 003) were written without hangs in nf90_enddef.

My run directory is: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/sudheer-case

DusanJovic-NOAA commented 3 months ago

According to the nc_enddef documentation here, specifically:

_It's not necessary to call nc_enddef() for netCDF-4 files. With netCDF-4 files, ncenddef() is called when needed by the netcdf-4 library.

which means we do not need to call nf90_redef/nf90_enddef at all, since the history files are netCDF-4 files, created with NF90_NETCDF4 mode. @edwardhartnett can you confirm this.

I'll try to remove all nf90_redef/nf90_enddef calls and see what happens.

edwardhartnett commented 3 months ago

@DusanJovic-NOAA you are correct, a file created with NC_NETCDF4 does not need to call enddef(), but I believe redef() must still be called.

For example, if you define some metadata, and then call nc_put_vara_float() (or some other data-writing function), then netCDF-4 will notice that you have not called nc_enddef(), and will call it for you.

But does that work for nc_redef()? I don't think so.

However, whether called explicitly by the programmer, or internally by the netCDF library, enddef()/redef() is an expensive operation. All buffers are flushed to disk. So try to write all your metadata (including all attributes), then write data. Don't switch back and forth.

In the case of the fragment of the code I see here, it seems like there's a loop:

for some cases
     redef()
     write attribute
     enddef()
     write data
end

What would be better would be two loops, the first to write all the attributes, the second to do all the data writes.

redef()
for some cases
     write attribute
end
enddef()
for some cases
     write data
end

SamuelTrahanNOAA commented 3 months ago

All of the variable data is written in a later loop except the dimension variables. Those are written in calls to subroutine add_dim inside the metadata-defining loop. It does have the required call to nf90_redef.

       if (lm > 1) then
         call add_dim(ncid, "pfull", pfull_dimid, wrtgrid, mype, rc)
         call add_dim(ncid, "phalf", phalf_dimid, wrtgrid, mype, rc)
       ... more of the same ...

  subroutine add_dim(ncid, dim_name, dimid, grid, mype, rc)
       ...
       ncerr = nf90_def_var(ncid, dim_name, NF90_REAL8, dimids=[dimid], varid=dim_varid); NC_ERR_STOP(ncerr)
       ...
       ncerr = nf90_enddef(ncid=ncid); NC_ERR_STOP(ncerr)
       ncerr = nf90_put_var(ncid, dim_varid, values=valueListR8); NC_ERR_STOP(ncerr)
       ncerr = nf90_redef(ncid=ncid); NC_ERR_STOP(ncerr)

DusanJovic-NOAA commented 3 months ago

@edwardhartnett Thanks for the confirmation.

@SamuelTrahanNOAA Yes, all variables are written in the second loop over all fields after all dimensions and attributes are defined and written. The only exception are 4 'dimension variables' or coordinates, (pfull, phalf, zsoil and time) in which case we define them, end define mode, write the coordinate values and reenter define mode. But those are small variables, and I do not think it costs a lot to exit/reenter define mode since there are just 4 of them and no other large variables are written yet. If that has any impact on the performance.

I'll run the test now with all enddef/redef calls removed to see if that works.

DusanJovic-NOAA commented 3 months ago

Documentation of nc_redef says:

_For netCDF-4 files (i.e. files created with NC_NETCDF4 in the cmode in their call to nc_create()), it is not necessary to call nc_redef() unless the file was also created with NC_STRICT_NC3. For straight-up netCDF-4 files, ncredef() is called automatically, as needed.

edwardhartnett commented 3 months ago

OK, so you could take out the redef() and enddef().

Usually when netCDF hangs on a parallel operation it's because a collective operation is done, but not all tasks participated. Are all programs running this metadata code?

SamuelTrahanNOAA commented 3 months ago

Are all programs running this metadata code?

A way to test that is to put an MPI_Barrier before each NetCDF call.

DusanJovic-NOAA commented 3 months ago

Without any explicit call to nf90_redef/nf90_enddef, model works fine for about 5 hr but then hangs while writing physics history file. Last file (forecast hour 6) is only partially written (~30Mb) before the model hangs:

-rw-r--r-- 1 Dusan.Jovic h-nems 1685751925 Mar 11 18:25 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1865073247 Mar 11 18:29 phyf001.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1878394918 Mar 11 18:33 phyf002.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1881375125 Mar 11 18:37 phyf003.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1876109574 Mar 11 18:41 phyf004.nc
-rw-r--r-- 1 Dusan.Jovic h-nems 1879258803 Mar 11 18:46 phyf005.nc
-rw-r--r-- 1 Dusan.Jovic h-nems   30817232 Mar 11 18:49 phyf006.nc

ncdump -h of phyf006.nc prints all metadata and exits without any error. Also comparing metadata and global attributes with nccmp does not report any difference between 005 and 006 files:

nccmp -mg phyf005.nc phyf006.nc

SamuelTrahanNOAA commented 3 months ago

Have we reached the point where we should involve NetCDF and HDF5 developers in this conversation?

DusanJovic-NOAA commented 3 months ago

Let me try your suggestion to insert an MPI_Barrier before each NetCDF call.

DusanJovic-NOAA commented 3 months ago

Now it hangs on the second history file (phyf001.nc):

-rw-r--r-- 1 Dusan.Jovic h-nems 1685751925 Mar 11 19:29 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic h-nems   30817232 Mar 11 19:33 phyf001.nc

Interestingly the file size is exactly the same (30817232 bytes) as in the previous run where model hangs at phyf006.nc. It also never hangs while writing dynf???.nc files, always at phyf???.nc.

SamuelTrahanNOAA commented 3 months ago

Do you know where it is hanging?

You can find out by sshing to one of the compute nodes running your job. Then start gdb on a running process. It may take a few tries to figure out which ranks are associated with the frozen quilt server.

SamuelTrahanNOAA commented 3 months ago

Interestingly the file size is exactly the same (30817232 bytes) as in the previous run where model hangs at phyf006.nc.

I suspect this is the size of the file's metadata.

DusanJovic-NOAA commented 3 months ago

The only thing that seems to help avoid the hangs is reducing the number of fields written out in the history file. At this moment writing out all the fields specified in 'diag_table', creates 260 variables. What is special about 260?. It is just slightly larger than 256. Could it be that 256 is, for whatever reason, some kind of limit?

I'm running now with just 4 fields commented out in diag_table, the last 4, just to see what happens.

# Aerosols emission for smoke
"gfs_sfc",     "emdust",       "emdust",        "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "coef_bb_dc",   "coef_bb_dc",    "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "min_fplume",   "min_fplume",    "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "max_fplume",   "max_fplume",    "fv3_history2d",  "all",  .false.,  "none",  2
"gfs_sfc",     "hwp",          "hwp",           "fv3_history2d",  "all",  .false.,  "none",  2
#"gfs_sfc",     "hwp_ave",      "hwp_ave",       "fv3_history2d",  "all",  .false.,  "none",  2
#"gfs_sfc",     "frp_output",   "frp_output",    "fv3_history2d",  "all",  .false.,  "none",  2
#"gfs_phys",    "ebu_smoke",    "ebu_smoke",     "fv3_history",    "all",  .false.,  "none",  2
#"gfs_phys",    "ext550",       "ext550",        "fv3_history",    "all",  .false.,  "none",  2

This should create a file with 256 variables.

SamuelTrahanNOAA commented 3 months ago

Disabling only the last two variables (ebu_smoke and ext550) is enough to get it to run reliably. There are other sets of variables one can remove to get it to run reliably. That's just the one I can remember off the top of my head.

DusanJovic-NOAA commented 3 months ago

Ok, so that means there is nothing special about 256 limit, which is good. That should also mean that there are no issues in nf90_* calls, since in that case (two variables less) everything works fine.

SamuelTrahanNOAA commented 3 months ago

Ok, so that means there is nothing special about 256 limit, which is good. That should also mean that there are no issues in nf90_* calls, since in that case (two variables less) everything works fine.

There must be an issue somewhere in there. The model freezes at an MPI_Allreduce deep within the HDF5 library.

SamuelTrahanNOAA commented 3 months ago

Can we try the gnu compiler with an alternative MPI implementation and still enable NetCDF parallel?

That would eliminate the compiler and MPI implementation as sources of the problem.

DusanJovic-NOAA commented 3 months ago

We can try that on Hera or Hercules. I was running these tests on Jet.

DusanJovic-NOAA commented 3 months ago

I recompiled the model on Hera with the gnu compiler, and submitted a job, in:

$ pwd
/scratch1/NCEPDEV/stmp2/Dusan.Jovic/sudheer-case

$ ls -l phyf00*
-rw-r--r-- 1 Dusan.Jovic stmp   30785571 Mar 11 20:27 phyf000.nc
-rw-r--r-- 1 Dusan.Jovic stmp          8 Mar 11 20:27 phyf000.nc-3181117440-28572.lock
-rw-r--r-- 1 Dusan.Jovic stmp 2694034582 Mar 11 20:34 phyf001.nc

Looks like model hangs while writing 0 hour physics file, but I also see this new 'lock' file. Any idea what it is?

SamuelTrahanNOAA commented 3 months ago

You could try:

export HDF5_USE_FILE_LOCKING=OFF

and see if that fixes it.

SamuelTrahanNOAA commented 3 months ago

Is it possible that the model is trying to open the same file twice at the same time?

DusanJovic-NOAA commented 3 months ago

Is it possible that the model is trying to open the same file twice at the same time?

Do you mean at the same time from different mpi tasks? In parallel mode, all tasks open (create) a file.

SamuelTrahanNOAA commented 3 months ago

No, I mean:

Ranks A, B, and C open file1.nc in MPI communicator X
Ranks D, E, and F open file1.nc in MPI communicator Y

It would explain the problems we're seeing.

DusanJovic-NOAA commented 3 months ago

This should not happen. If it does it's a bug.

If more than one write group is used (in this case 2 groups are used), different groups will run on separate (non-overlapping) communicators.

So, in this example with 2 write groups:

phyf000.nc will be created by MPI communicator X (Ranks A, B, and C) phyf001.nc will be created by MPI communicator Y (Ranks D, E, and F)

phyf002.nc will be created by MPI communicator X (Ranks A, B, and C), if communicator X is still busy writing the previous file code waits phyf003.nc will be created by MPI communicator Y (Ranks D, E, and F), if communicator Y is still busy writing the previous file code waits

... etc

SamuelTrahanNOAA commented 3 months ago

Perhaps you could try one write group to see if that fixes the problem? It is unlikely, but maybe we'll get lucky.

DusanJovic-NOAA commented 3 months ago

Hangs with one write group as well.

DusanJovic-NOAA commented 3 months ago

I also made a change in module_write_netcdf by splitting add_dim routine into two, such that in the first routine (add_dim) dimensions are only defined, and in the second (write_dim) dimension variable data are written. The add_dim is called first, then we explicitly call nf90_enddef (even though it's not strictly necessary for netcdf-4), and then we call write_dim after we exit the define mode. This way we do not leave and reenter the define mode multiple times.

My updated module_write_netcdf is here:

https://github.com/DusanJovic-NOAA/fv3atm/blob/rrfs_write_netcdf_hangs/io/module_write_netcdf.F90

I also changed the access pattern for all variables to NF90_COLLECTIVE. I'm not sure if this is necessary or even desirable, but just for testing.

Unfortunately even with these two changes the model still hangs at random while saving physics history files.

DusanJovic-NOAA commented 3 months ago

I built the latest versions of hdf5(1.14.3), netcdf-c (4.9.2) and netcdf-fortran (4.6.1) on Hera with GNU and ran the test. I now see this error:

603: HDF5-DIAG: Error detected in HDF5 (1.14.3) MPI-process 603:
603:   #000: H5A.c line 2397 in H5Aexists(): can't synchronously check if attribute exists
603:     major: Attribute
603:     minor: Can't get value
603:   #001: H5A.c line 2368 in H5A__exists_api_common(): unable to determine if attribute exists
603:     major: Attribute
603:     minor: Can't get value
603:   #002: H5A.c line 2328 in H5A__exists_common(): unable to determine if attribute exists
603:     major: Attribute
603:     minor: Can't get value
603:   #003: H5VLcallback.c line 1536 in H5VL_attr_specific(): unable to execute attribute 'specific' callback
603:     major: Virtual Object Layer
603:     minor: Can't operate on object
603:   #004: H5VLcallback.c line 1502 in H5VL__attr_specific(): unable to execute attribute 'specific' callback
603:     major: Virtual Object Layer
603:     minor: Can't operate on object
603:   #005: H5VLnative_attr.c line 473 in H5VL__native_attr_specific(): unable to determine if attribute exists
603:     major: Attribute
603:     minor: Can't get value
603:   #006: H5Oattribute.c line 1732 in H5O__attr_exists(): error checking for existence of attribute
603:     major: Attribute
603:     minor: Iteration failed
603:   #007: H5Adense.c line 1679 in H5A__dense_exists(): can't search for attribute in name index
603:     major: Attribute
603:     minor: Object not found
603:   #008: H5B2.c line 609 in H5B2_find(): can't compare btree2 records
603:     major: B-Tree node
603:     minor: Can't compare objects
603:   #009: H5B2int.c line 104 in H5B2__locate_record(): can't compare btree2 records
603:     major: B-Tree node
603:     minor: Can't compare objects
603:   #010: H5Abtree2.c line 264 in H5A__dense_btree2_name_compare(): can't compare btree2 records
603:     major: Heap
603:     minor: Can't compare objects
603:   #011: H5HF.c line 662 in H5HF_op(): can't operate on 'huge' object from fractal heap
603:     major: Heap
603:     minor: Can't operate on object
603:   #012: H5HFhuge.c line 918 in H5HF__huge_op(): unable to operate on heap object
603:     major: Heap
603:     minor: Can't operate on object
603:   #013: H5HFhuge.c line 770 in H5HF__huge_op_real(): application's callback failed
603:     major: Heap
603:     minor: Can't operate on object
603:   #014: H5Abtree2.c line 154 in H5A__dense_fh_name_cmp(): can't decode attribute
603:     major: Object header
603:     minor: Unable to decode value
603:   #015: H5Omessage.c line 1636 in H5O_msg_decode(): unable to decode message
603:     major: Object header
603:     minor: Unable to decode value
603:   #016: H5Oshared.h line 74 in H5O__attr_shared_decode(): unable to decode native message
603:     major: Object header
603:     minor: Unable to decode value
603:   #017: H5Oattr.c line 277 in H5O__attr_decode(): ran off end of input buffer while decoding
603:     major: Object header
603:     minor: Address overflowed

I also repeated a test in which I commented out two last fields in 'diag_table' and it worked, for few output hours without hanging. But we already know that this is expected to work.

SamuelTrahanNOAA commented 3 months ago

I also repeated a test in which I commented out two last fields in 'diag_table' and it worked, for few output hours without hanging. But we already know that this is expected to work.

When you comment out those fields, does it eliminate the error messages?

DusanJovic-NOAA commented 3 months ago

I also repeated a test in which I commented out two last fields in 'diag_table' and it worked, for few output hours without hanging. But we already know that this is expected to work.

When you comment out those fields, does it eliminate the error messages?

Yes.

SamuelTrahanNOAA commented 3 months ago

This sounds like something specific enough to do a bug report for the NetCDF library developers.

DusanJovic-NOAA commented 3 months ago

In the err file I also see:

627:  file: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/rrfs_netcdf_hangs/ufs-weather-model/FV3
/io/module_write_netcdf.F90 line:          499 NetCDF: Problem with HDF5 dimscales.

After a little bit of grepping and printf debugging I found that the error (Problem with HDF5 dimscales) is returned from:

https://github.com/Unidata/netcdf-c/blob/9328ba17cb53f13a63707547c94f4715243dafdf/libhdf5/nc4hdf.c#L1472-L1473

This function (attach_dimscales) is conditionally called from:

https://github.com/Unidata/netcdf-c/blob/9328ba17cb53f13a63707547c94f4715243dafdf/libhdf5/nc4hdf.c#L2012-L2016

Looks like this attaching can be disabled based on no_dimscale_attach, which can be set to true based on NC_NODIMSCALE_ATTACH flag:

https://github.com/Unidata/netcdf-c/blob/9328ba17cb53f13a63707547c94f4715243dafdf/libhdf5/hdf5create.c#L211-L214

Unfortunately netcdf-fortran does not provide fortran version of this constant, so I defined it locally in write_netcdf routine as:

  integer, parameter :: NF90_NODIMSCALE_ATTACH = int(Z'40000')

and when I create netcdf file as:

  ncerr = nf90_create(trim(filename),&
                      cmode=IOR(IOR(NF90_CLOBBER,netcdf_file_type),NF90_NODIMSCALE_ATTACH),&    
                      comm=mpi_comm, info = MPI_INFO_NULL, ncid=ncid); NC_ERR_STOP(ncerr)

and run the test it seems to work. Obviously we need to run a full 24h run, many times, to verify that this indeed solves the problem. Temporarily. And we need to find correct, more permanent solution. But let's see if this is the real cause of the hangs.

DusanJovic-NOAA commented 3 months ago

I ran this test with GNU on Hera, I'll also test it with Intel.

My code updates are here:

https://github.com/NOAA-EMC/fv3atm/compare/develop...DusanJovic-NOAA:fv3atm:rrfs_write_netcdf_hangs

junwang-noaa commented 3 months ago

@DusanJovic-NOAA Thanks for debugging the issue! Once it is confirmed the issue is fixed, maybe we can ask @edwardhartnett to update the netcdf-c.

edwardhartnett commented 3 months ago

Mostly when users think they have found a bug in netCDF, they are mistaken. NetCDF code is well-tested.

If you think you have found a netCDF bug, we need a (one-file) test program which demonstrates it. This should be a unit test of the write component, and should remain a unit test once we get all this sorted out. If a future release of netCDF/HDF-5 breaks the test, you know that something important has changed and your code won't work. (And the most likely scenario is that while constructing such a test, you will find that netCDF is actually working the way it is supposed to.)

Here's an example from 2020 concerning the fv3 code: https://github.com/Unidata/netcdf-c/blob/main/nc_perf/tst_compress_par.c

In this case, there was a belief that there were bugs in netCDF relating to compression and parallel writes. I took a bunch of fv3 IO code, munged it into a one-file test, and demonstrated that netCDF was working just fine with parallel compression. I put this test in netcdf-c because there was no way at the time to put it into FV3. Lack of unit testing at this granularity for FV3 made debugging I/O a slow and painful process. Lack of unit testing costs the organization.

This write component is all about writing netCDF data. Is there a test for it? If not, now's the time to add the first test.

@DusanJovic-NOAA @SamuelTrahanNOAA we need a one-file test program which demonstrates how the write components uses netCDF parallel I/O. This will either cause you to find a bug in the write component, or demonstrate a netCDF bug. Can you produce that test?

SamuelTrahanNOAA commented 3 months ago

Mostly when users think they have found a bug in netCDF, they are mistaken

No. Not a bug. This is a missing feature. The Fortran library doesn't forward the NC_NODIMSCALE_ATTACH to Fortran.

  integer, parameter :: NF90_NODIMSCALE_ATTACH = int(Z'40000')

We can make a feature request to get that added. No further proof is needed.

edwardhartnett commented 3 months ago

The lack of NODIMSCALE_ATTACH should not cause a problem.

Dimscales are a HDF5 feature used to keep track of dimensions. However, it performs poorly at scale, so I added a way to ignore dimscales. But that should already be happening for you. (That is, you do not have to turn this optimization on, it will be used for all new files automatically.)

Also, the dimscales should not get out of sync. Turning off the dimscales should improve performance opening files with many (i.e. hundreds or more) variables, but should never show the error that you found.

I suspect there is some problem in your metadata code. Or, perhaps, you have really found a problem in netCDF.

ufs-community / ufs-weather-model