Default chunksizes for continuous fields are too small in RRFS restart files

EricRogers-NOAA commented 2 years ago

The RRFS North American domain (npx, npy=3951, 2701) is so large that model restart files must be written in netcdf4 format (i.e., netcdf_default_format = "netcdf4"). The default chunksizes in the fv_core, fv_restart files are these values, for example:

float sphum(Time, zaxis_1, yaxis_1, xaxis_1) ; sphum:checksum = " 93FD4CEFDD0C45B" ; sphum:_Storage = "chunked" ; sphum:_ChunkSizes = 1, 6, 300, 439 ; sphum:_Endianness = "little" ;

The above chunksizes are too small for the size of this grid and as a result the parallel I/O in the RRFS GSI analysis runs inefficiently. With the above chunksizes, the RRFS GSI run time are average ~1750-1900 sec on WCOSS Dell (70 nodes, 4 tasks/node). When the RRFS restart files are converted to 64bit format, or have the chunksizes modified with the nccopy command to these values:

float sphum(Time, zaxis_1, yaxis_1, xaxis_1) ; sphum:checksum = " 93FD4CEFDD0C45B" ; sphum:_Storage = "chunked" ; sphum:_ChunkSizes = 1, 65, 2700, 3950 ; sphum:_Endianness = "little" ; sphum:_NoFill = "true" ;

the RRFS GSI on WCOSS Dell is about 25-30% faster.

junwang-noaa commented 2 years ago

@edwardhartnett Would you please take a look at the NetCDF4 issue here? We hope to understand how the chunksizes (1, 6, 300, 439) for data array sphum with dimension (1, 65, 2700, 3950) are decided in NetCDF when they are not specified in the netcdf calls. Thank you very much!

edwardhartnett commented 2 years ago

NetCDF chooses chunksizes as best it can, but is not always going to make a good choice.

Fortunately, the netCDF default choice can easily be overwritten. If using the F90 API, set the optional parameter chunksizes to the sizes that you want. That is the answer to your performance problems.

jswhit commented 2 years ago

you can set the parameters ichunk2d, jchunk2d, ichunk3d, jchunk3d, kchunk3d in model_configure to control the chunksizes used in the history files produced by the write component.

If all the values are -1 then the netcdf-c library chooses the chunksizes.

If all the values are zero then the chunksizes are set to the grid decomposition in the write component (which depends on the number of write_tasks)

Othersize, the values given are directly used to define the chunksizes. The values chosen can have a big impact on IO performance.

jswhit commented 2 years ago

Oops - I see you are talking about restarts here and not history files. Since restart files are written by FMS, not the write component, the chunk values in model_configure don't apply.

I see @TingLei-NOAA asked the FMS devs how to change the default chunksize (https://github.com/NOAA-GFDL/FMS/discussions/892). The answer was there is currently no mechanism to do so.

jswhit commented 2 years ago

The write grid component can output the tiles in history files, and I think the plan is for JEDI to use these instead of the history files (@CoryMartin-NOAA can correct me if I'm wrong about this). One option would be to have RRFS/GSI switch to using the tiled history files to get faster IO throughput (by having control of the chunksizes without having to rewrite the files with nccopy).

junwang-noaa commented 2 years ago

@jswhit Post can't process native grid at this time. Also if model is called within JEDI in the 4D VAR case, the model fields can be passed out on native grid to DA. For offline cases, we can consider having two output grids(native grid for JEDI and other regular grid for inline post), which needs some additional work.

@edwardhartnett For this issue, fms io actually does not set chunksizes in the nf90_def_var calls when it writes out data. As Rusty pointed out, according to the NetCDF documentation, if the default data layout is contiguous and no chunksize arguments are provided, the netcdf calls should not write out the chunked data in the Netcdf4/HDF5 files. But we saw these small chunks in the output data, which slows down the reading in the downstream GSI job. So my question is, why are the data fields written out with chunks and if there is a way we can turn off the chunksize and let the netcdf calls write out the whole data array? Thanks

jswhit commented 2 years ago

@junwang-noaa Vars that use compression or checksums or unlimited dims can't be contiguous. I think that's why chunking is on by default

junwang-noaa commented 2 years ago

@jswhit Thanks for the information. My understanding is that the restart files are not compressed. Also according to @bensonr's comments in issue #801: "Don't be confused by the presence of a checksum attribute as that is created and written by FMS and not related to the fletcher32 argument to nf90_def_var." So I think the reason is that time dimension is "UNLIMITTED", I did see it in the 64Bit restart files.

@EricRogers-NOAA Would you please check your restart files to see if they have the "UNLIMTED" time dimension? Thanks

jswhit commented 2 years ago

global model restart files do have an unlimited time dimension

jswhit commented 2 years ago

you can turn on compression for restart files using

&mpp_io_nml
  shuffle=1,
  deflate_level=1,
/

EricRogers-NOAA commented 2 years ago

@junwang-noaa the LAMDAX restart files do have UNLIMITED for the time dimension:

netcdf fv_tracer.res.tile1 { dimensions: xaxis_1 = 3950 ; yaxis_1 = 2700 ; zaxis_1 = 65 ; Time = UNLIMITED ; // (1 currently)

junwang-noaa commented 2 years ago

@jswhit I can't find any places where the deflate_level is used in the netcdf calls in fms. @bensonr are the restart files written with compression? Thanks

jswhit commented 2 years ago

it's optional - default is no compression.

yangfanglin commented 2 years ago

I recall we did test compression for writing out GFS.v16 restart files by FMS . Using the following reduce the size of restart file by 50%

&mpp_io_nml shuffle=1, deflate_level=5, /

However, what we are using now in GFS.v16 is compression for RESTART files written by FMS export shuffle=1 export deflate_level=1

I cannot find out why we did not use deflate_level=5. Jeff, are restart files still compressed with this setting ?

jswhit commented 2 years ago

there is very little difference in file size with deflate_level=1 and 5 for the restarts, but the IO is much faster with deflate_level=1

edwardhartnett commented 2 years ago

Can you post the ncdump -h -s of one of the data files?

Compression should be set to 1. Larger values are much slower but don't really compress better.

EricRogers-NOAA commented 2 years ago

@edwardhartnett I'd have to run a new RRFS test with the latest code to get the chunksizes for all the restart file records, but here is one variable from one of the RRFS restart files (npx, npy=3951, 2701)

float sphum(Time, zaxis_1, yaxis_1, xaxis_1) ; sphum:checksum = " 93FD4CEFDD0C45B" ; sphum:_Storage = "chunked" ; sphum:_ChunkSizes = 1, 6, 300, 439 ; sphum:_Endianness = "little" ;

This is not directly related to this issue, but you told Ting Lei and I at one point that we could use nccopy to change the chunksizes on the fly, like this:

nccopy -c xaxis_1/3950,xaxis_2/3951,yaxis_1/2701,yaxis_2/2700,zaxis_1/65 fv_core.res.tile1.nc fv_core.res.tile1_new.nc nccopy -c xaxis_1/3950,yaxis_1/2700,zaxis_1/65 fv_tracer.res.tile1.nc fv_tracer.res.tile1_new.nc

The above worked for netcdf v4.5.0 (and Ting confirmed it also worked for netcdf 4.7.0) but the above did not work for netcdf v4.7.4 (my tests were on WCOSS1 Dell, Ting's on Hera).

edwardhartnett commented 2 years ago

OK, that seems like a bug in nccopy. I have added it to the netCDF issues list: https://github.com/Unidata/netcdf-c/issues/2442

@bensonr if you remind me where the netCDF code is in FMS I will take a look.

bensonr commented 2 years ago

There's a lot of confusion above between mpp_io/fms_io and the updated/rewritten fms2_io layer. It's even more chaotic adding in what code you are using from within the UFS repository timeline. Commits in late August 2021 updated FV3 and the physics IO layer (FV3GFS_io.F90) to remove the older mpp_io/fms_io interfaces and use the updated IO layer for native-grid restarts and a small number of native-grid diagnostics files handled directly by the FMS diagnostics layer. Therefore, anything using the updated codebase will ignore any and all hints given by the mpp_io_nml namelist settings. The fms2_io layer does not currently have any explicit control mechanisms for netcdf4-format chunking. If any were to be coded up, it would be done in such a way that each component using the IO layer would need to individually set/control the chunking, as a one-size fits all approach is more than likely not sufficient to ensure proper IO performance.

For restarts written via the fms2_io layer, the unlimited Time dimension is included by default as an extension to allow one to include restart variables requiring more than one time level. As was pointed out by Jeff Whitaker, an unlimited time dimension removes the contiguous constraint for netcdf4 formatted output and chunking is automatically activated.

What is unknown, because I haven't bothered to look at the NetCDF/HDF5 source and figure it out, is how the chunking is applied in the absence of any hints or explicit specifications. For a C768 file restart file, the chunking factor appears to be 4 in each dimension creating a total of 64 chunks with each chunk comprising about 850,000 elements. For the original grid in question (3951 x 2701 x 54?), the chunking factor is a more extreme '9' yielding 729 total chunks consisting of about 800,000 elements per chunk. With the data being 'float', each chunk is about 3MB in size - which is fine if the corresponding open/read of the file uses this embedded information to issue appropriate chunksized read requests. I just hope those coding up NetCDF/HDF5 primitives for read, as well as the library, are using something other than the standard system default of 2 pages of memory - typically 8KB of data per read request.

edwardhartnett commented 2 years ago

I can answer some of your questions about netCDF-4 default chunksizes.

When I wrote netCDF-4, I came up with a very primitive algorithm to come up with default chunksizes. The biggest gotcha is that the default chunksize for an unlimited dimension is 1. This is a terrible chunksize, unless you have only one record. But it's impossible to choose a larger chunksize for an unlimited dimension, because we don't know what size it will be.

The code to generate default chunksizes was later refined by Russ Rew. For each dimension, it tries to use:

            suggested_size = (pow((double)DEFAULT_CHUNK_SIZE/(num_values * type_size),
                                  1.0/(double)(var->ndims - num_unlim)) * var->dim[d]->len - .5);

The goal is to get largish, squarish chunks.

Always set your own chunksizes for best performance. Always increase the size of the unlimited dimension chunksize (1 by default) if you are writing more than one record of data. Try to use large chunks, of similar magnitude in each dimension. The best approach is to test with a variety of chunksizes and find what works well for your machine and code.

junwang-noaa commented 2 years ago

@JianpingHuang-NOAA FYI.

junwang-noaa commented 1 year ago

Dusan is working with GFDL on adding chunksize and compression option in the restart file, details specified in issue #1574. Close this ticket.

ufs-community / ufs-weather-model

Default chunksizes for continuous fields are too small in RRFS restart files #1270