Open yunfangsun opened 1 week ago
@yunfangsun There are some configuration options in PIO side that we could try to optimize the I/O and prevent CMEPS to hang when it is writing history and restart files. @jedwards4b and @DeniseWorthen might have some idea. @jedwards4b I wonder if there is any specific I/O option that we could try to test with this high-resolution case in CMEPS side?
@yunfangsun and all, I transfer this issue to CMEPS since it seems it is related with it.
@jedwards4b and @DeniseWorthen Since this is UFS the relevant part that gets the PIO options are in https://github.com/ESCOMP/CMEPS/blob/e84e8a1f4fbe4073e82435c72459352de6077bb2/mediator/med_io_mod.F90#L177. So, you could see the defaults PIO options for UFS/CMEPS.
@yunfangsun Following is the options used by default for rt_324728_atmschww3_06132024
configuration (seen in mediator.log
file).
(med_io_init) : pio_netcdf_format = 64BIT_OFFSET 512
(med_io_init) : pio_typename = NETCDF 2
(med_io_init) : pio_root = 1
(med_io_init) : pio_stride = -99
(med_io_init) : pio_numiotasks = -99
(med_io_init) : update pio_numiotasks = 4
(med_io_init) : update pio_stride = 400
(med_io_init) : pio_rearranger = SUBSET 2
(med_io_init) calling pio init
(med_io_init) : pio_debug_level = 0
(med_io_init) : pio_rearr_comm_type = P2P 0
(med_io_init) : pio_rearr_comm_fcd = 2DENABLE 0
(med_io_init) : pio_rearr_comm_enable_hs_comp2io = T
(med_io_init) : pio_rearr_comm_enable_isend_comp2io = F
(med_io_init) : pio_rearr_comm_max_pend_req_comp2io = 0
(med_io_init) : pio_rearr_comm_enable_hs_io2comp = F
(med_io_init) : pio_rearr_comm_enable_isend_io2comp = T
(med_io_init) : pio_rearr_comm_max_pend_req_io2comp = 64
(med_io_init) calling pio_set_rearr_opts
It seems that it is not using parallel I/O. You could try to use parallel I/O by setting pio_typename
. Here is an example for your ufs.configure
file as following (you just need to update MED
section).
# MED #
MED_model: cmeps
MED_petlist_bounds: 0 1599
MED_omp_num_threads: 1
MED_attributes::
ATM_model = datm
OCN_model = schism
WAV_model = ww3
history_n = 1
history_option = nhours
history_ymd = -999
coupling_mode = coastal
pio_typename = PNETCDF
::
You could also try to use different number for pio_stride
(the stride of IO tasks across available compute tasks). You could try to set it to 4, 8 etc. to see any performance improvement. The default pio_rearranger
is set to subset
which is fine for high processor counts. The default pio_numiotasks
is 4 which seems not enough. So, if you set pio_stride
to 4 then you will have 1600/4 = 400 pio_numiotasks
. Anyway, please experiment those numbers and let me know how it goes. Please also check mediator.log file and be sure that the PIO options are changing based on your configuration file. If you have successful run then you might collect timing to find the best configuration for this case. The numbers could be different for the other cases that uses more core.
Outside of the CESM, you can set the PIO options (numio tasks, etc) via config. See https://github.com/oceanmodeling/CMEPS/blob/437d5e6f3507b84cdabaac02b0335e86d3013dc6/mediator/med_io_mod.F90#L184
Also, a question. Why are you having mediator history files written? That is a lot of I/O! Normally history files are used for debugging or diagnosing field exchange issues. They are not used in production runs.
@DeniseWorthen I agree with you. In the development, I am activating mediator history and restart to check the exchanged fields but @yunfangsun could disable in his production runs. On the other hand, it would be nice to figure out the issue with the mediator I/O. So, once we ned to debug something with the high-res case, it would be available. I bet that it is related with the serial I/O (which is default).
I think that there is an issue with the history write alarm, nothing to do with pio - I am working on that today.
@uturuncoglu Yes, I agree that you need to switch to pnetcdf at a minimum. I also believe there is an issue w/ WW3 restarts when using PDLIB. So another idea might be testing w/o the ww3 restarts (set the date%restart2%stride
to some value > than the run length. You need to set the value in seconds.
@DeniseWorthen Thanks. I think we need to open an issue related with the WW3 restart in our end to track it. We did not test coastal specific ocean models restart capability. So, we don't know they will restart perfectly or not in a coupled application. I raised this couple of times in our internal meetings but at this point, we don't have enough resource to check them. @janahaddad maybe you could create an issue for the restart capability and once we have time we could just focus restart capability of ROMS and SCHSIM.
@uturuncoglu I understand about not testing restart capability at this time. The issue I've heard about second-hand is that restart writing when using PDLIB is very very slow. If you don't need WW3 restarts, I wouldn't write them.
@DeniseWorthen Okay. Thanks for the clarification and your help. I was thinking there is an issue in the restart files itself. Good to know. Maybe this is more problematic for the high resolution cases. @yunfangsun you could also try to disable writing restart files as @DeniseWorthen suggested to see any performance improvement.
Hi @uturuncoglu ,
1) I have added pio_typename = PNETCDF
to the 1600 cores case, it did work, and the model could continue to run without hanging, the case location is at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5
2) With a different pio_stride number didn't make any change for the speed, the case is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5 _1
Thank you for the help.
@yunfangsun From your mediator.log file, I see that it is still using a default of 4 iotasks. If you have 1600 cores for CMEPS, I would suggest that you try setting the number of iotasks higher. For a stride of 4 you should be able to get 400 IO tasks. Set the number of tasks with pio_numiotasks=
.
@yunfangsun That is great news. Glad that it worked. I agree with @DeniseWorthen about increasing number of task for I/O. If you have time and don't mind, could you have couple of run (different number of I/O tasks, side etc.) and collect some timing results. It would be nice to change one parameter at a time to see its effect. I think that would be very helpful for the future and we could use it as reference for other cases. In actual runs, we could disable mediatory history and restart or write them just end of the simulation to optimize the I/O more.
Hi @uturuncoglu ,
Sure, I will do the test when Hercules is back online.
Hi @uturuncoglu and @DeniseWorthen ,
I have tried to use different pio_numiotasks settings:
pio_numiotasks = 8 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_2)
pio_numiotasks = 16 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_3)
pio_numiotasks = 32 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_4)
The speeds have no differences for the three cases.
@yunfangsun Thanks for having additional tests. The results are little bit Interesting. I wonder your frequency to write history and restart files. I think if you increase that frequency you might start seeing some difference. If you don't mind could you check your configuration?
Hi @uturuncoglu
The frequency for the history and restart files is the same for all the tests, which is as follows
runSeq::
@3600
MED med_phases_prep_atm
MED med_phases_prep_ocn_accum
MED med_phases_prep_ocn_avg
MED med_phases_prep_wav_accum
MED med_phases_prep_wav_avg
MED -> ATM :remapMethod=redist
MED -> OCN :remapMethod=redist
MED -> WAV :remapMethod=redist
ATM
OCN
WAV
ATM -> MED :remapMethod=redist
OCN -> MED :remapMethod=redist
WAV -> MED :remapMethod=redist
MED med_phases_post_atm
MED med_phases_post_ocn
MED med_phases_post_wav
MED med_phases_history_write
MED med_phases_restart_write
@
::
@yunfangsun the history and restart file interval basically configured via ufs.configure and it is not related with the run sequence. Could you share your ufs.configure?
Hi @uturuncoglu
You can check out the case in /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_4
The part related to history and restart is as follows:
# MED #
MED_model: cmeps
MED_petlist_bounds: 0 1599
MED_omp_num_threads: 1
MED_attributes::
ATM_model = datm
OCN_model = schism
WAV_model = ww3
history_n = 1
history_option = nhours
history_ymd = -999
coupling_mode = coastal
pio_typename = PNETCDF
pio_numiotasks = 32
::
ALLCOMP_attributes::
ScalarFieldCount = 3
ScalarFieldIdxGridNX = 1
ScalarFieldIdxGridNY = 2
ScalarFieldIdxNextSwCday = 3
ScalarFieldName = cpl_scalars
start_type = startup
restart_dir = RESTART/
case_name = ufs.cpld
restart_n = 12
restart_option = nhours
restart_ymd = -999
orb_eccen = 1.e36
orb_iyear = 2022
orb_iyear_align = 2022
orb_mode = fixed_year
orb_mvelp = 1.e36
orb_obliq = 1.e36
stop_n = 120
stop_option = nhours
stop_ymd = -999
::
@yunfangsun It seems that you are writing history file every hour and restart in every 12 hours. If you don't mind, could you confirm it from your run directory.
history_n = 1
history_option = nhours
history_ymd = -999
restart_n = 12
restart_option = nhours
restart_ymd = -999
Hi @uturuncoglu ,
Yes, I can confirm it.
@uturuncoglu @saeed-moghimi-noaa @janahaddad @pvelissariou1
This is to document a bug related to med_phases_history_write med_phases_restart_write.
The original Run Sequence for ufs_atm2sch2ww3 is as follows:
It works for the coastalikeshinnecock_atm2sch2ww3 case, however, when this configuration is applied to a higher resolution mesh (HSOFS, 1.8 million nodes), the simulation stopped after a 1-hour simulation with 1600 cores. This case is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel on Hercules.
Then I tried to remove
MED med_phases_history_write
, and using 200 cores for the same case, it stopped after 16-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3).With the same configuration as above but with 6000 cores, the case stopped after 12-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4).
Since the
restart_n = 12
is in this ufs.configure, Then the restart is turned off byrestart_option = never
, the case with 6000 cores could finish the total 17-day simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1)The problem is that in the 200 cores run, the restart is also set as
restart_n = 12
, but the simulation stopped at 16-hour.It seems
med_phases_history_write med_phases_restart_write
can not function well in this case (have to turn off both), the question is then, how could we correctly configure thisufs.configure
.The details are at https://github.com/oceanmodeling/ufs-weather-model/issues/103