Open kaiyuan-cheng opened 3 months ago
Which external file is being held in memory on each MPI task?
I was talking about the climatological datasets located in fix/sfc_climo. For example, the size of snowfree_albedo.4comp.0.05.nc is 4.7 GB. When multiplied by 30, the total size would be approximately 150 GB.
The climo datasets are read in on one MPI task, then a subsection is scattered to all tasks.
Here, the array that holds climo data is only allocated on task '0': https://github.com/ufs-community/UFS_UTILS/blob/47705d5315013c89841cf3645d549e9bc83ce6e8/sorc/sfc_climo_gen.fd/interp.F90#L77
Then, the climo data is read in on task '0', then chopped up and scattered to all tasks: https://github.com/ufs-community/UFS_UTILS/blob/47705d5315013c89841cf3645d549e9bc83ce6e8/sorc/sfc_climo_gen.fd/interp.F90#L108
You are right. My initial speculation about the substantial memory usage was wrong. I did a memory usage profiling for a global C48 grid with different numbers of MPI tasks, ranging from 30 to 60. The memory usage remained around 175GB regardless of the number of MPI tasks. Therefore, the memory issue must be caused by something else.
How are you configuring the run? Can I see the fort.41 namelist?
&config input_facsf_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/facsf.1.0.nc" input_substrate_temperature_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/substrate_temperature.gfs.0.5.nc" input_maximum_snow_albedo_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/maximum_snow_albedo.0.05.nc" input_snowfree_albedo_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/snowfree_albedo.4comp.0.05.nc" input_slope_type_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/slope_type.1.0.nc" input_soil_type_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/soil_type.bnu.v3.30s.nc" input_soil_color_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/soil_color.clm.0.05.nc" input_vegetation_type_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/vegetation_type.viirs.v3.igbp.30s.nc" input_vegetation_greenness_file="/autofs/ncrc-svm1_home2/Kai-yuan.Cheng/software/UFS_UTILS/driver_scripts/../fix/sfc_climo/vegetation_greenness.0.144.nc" mosaic_file_mdl="/gpfs/f5/gfdl_w/world-shared/Kai-yuan.Cheng/my_grids/C48/C48_mosaic.nc" orog_dir_mdl="/gpfs/f5/gfdl_w/world-shared/Kai-yuan.Cheng/my_grids/C48" orog_files_mdl="C48_oro_data.tile1.nc","C48_oro_data.tile2.nc","C48_oro_data.tile3.nc","C48_oro_data.tile4.nc","C48_oro_data.tile5.nc","C48_oro_data.tile6.nc" halo=0 maximum_snow_albedo_method="bilinear" snowfree_albedo_method="bilinear" vegetation_greenness_method="bilinear" fract_vegsoil_type=.false. /
I see you are using the 30-sec soil and vegetation type datasets. They are quite large. There are lower-res versions of the soil and veg data. Can you use those?
input_vegetation_type_file="vegetation_type.modis.igbp.0.05.nc"
input_soil_type_file="soil_type.statsgo.0.05.nc"
In this case, the memory usage decreases to 33GB, which is somewhat manageable for non-HPC systems. However, 30-sec datasets should not have such a large memory footprint. Assuming a single-precision floating-point variable, the array storing the entire 30-sec dataset should be merely 3.5 GB (21600 43200 4 bytes). The overhead of sfc_climo_gen seems excessively high.
I would guess the ESMF regridding is using a lot of memory. I can contact the ESMF team and provide them your test case. They may have suggestions to reduce the memory requirements.
Sounds good. Thank you for working on this. I also found that when using lower-res soil and veg data, sfc_climo_gen can run with just 6 MPI tasks. It appears that higher resolution datasets require more MPI tasks, which could be a limitation also related to ESMF.
For sfc_climo_gen, 30 MPI processes seem to be the minimum requirement and the memory footprint is at least 150 GB. The substantial memory usage may be due to each MPI process having a copy of the external file in memory. This high demand for MPI processes and memory makes running UFS_UTIL on non-HPC systems nearly impossible. Is it possible to improve the computational efficiency of sfc_climo_gen?
P.S. chgres_cube, the code of which appears similar to sfc_climo_gen, can run with just 6 MPI processes.