history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case

yunfangsun commented 1 week ago

@uturuncoglu @saeed-moghimi-noaa @janahaddad @pvelissariou1

This is to document a bug related to med_phases_history_write med_phases_restart_write.

The original Run Sequence for ufs_atm2sch2ww3 is as follows:

# Run Sequence #
runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_ocn_accum
  MED med_phases_prep_ocn_avg
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> OCN :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  OCN
  WAV
  ATM -> MED :remapMethod=redist
  OCN -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_ocn
  MED med_phases_post_wav
  MED med_phases_history_write
  MED med_phases_restart_write
@
::

It works for the coastalikeshinnecock_atm2sch2ww3 case, however, when this configuration is applied to a higher resolution mesh (HSOFS, 1.8 million nodes), the simulation stopped after a 1-hour simulation with 1600 cores. This case is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel on Hercules.

Then I tried to remove MED med_phases_history_write, and using 200 cores for the same case, it stopped after 16-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3).

With the same configuration as above but with 6000 cores, the case stopped after 12-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4).

Since the restart_n = 12 is in this ufs.configure, Then the restart is turned off by restart_option = never, the case with 6000 cores could finish the total 17-day simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1)

The problem is that in the 200 cores run, the restart is also set as restart_n = 12, but the simulation stopped at 16-hour.

It seems med_phases_history_write med_phases_restart_write can not function well in this case (have to turn off both), the question is then, how could we correctly configure this ufs.configure.

The details are at https://github.com/oceanmodeling/ufs-weather-model/issues/103

uturuncoglu commented 1 week ago

@yunfangsun There are some configuration options in PIO side that we could try to optimize the I/O and prevent CMEPS to hang when it is writing history and restart files. @jedwards4b and @DeniseWorthen might have some idea. @jedwards4b I wonder if there is any specific I/O option that we could try to test with this high-resolution case in CMEPS side?

uturuncoglu commented 1 week ago

@yunfangsun and all, I transfer this issue to CMEPS since it seems it is related with it.

uturuncoglu commented 1 week ago

@jedwards4b and @DeniseWorthen Since this is UFS the relevant part that gets the PIO options are in https://github.com/ESCOMP/CMEPS/blob/e84e8a1f4fbe4073e82435c72459352de6077bb2/mediator/med_io_mod.F90#L177. So, you could see the defaults PIO options for UFS/CMEPS.

uturuncoglu commented 1 week ago

@yunfangsun Following is the options used by default for rt_324728_atmschww3_06132024 configuration (seen in mediator.log file).

 (med_io_init) : pio_netcdf_format = 64BIT_OFFSET         512
 (med_io_init) : pio_typename = NETCDF           2
 (med_io_init) : pio_root =            1
 (med_io_init) : pio_stride =          -99
 (med_io_init) : pio_numiotasks =          -99
 (med_io_init) : update pio_numiotasks =            4
 (med_io_init) : update pio_stride =          400
 (med_io_init) : pio_rearranger = SUBSET           2
 (med_io_init) calling pio init
 (med_io_init) : pio_debug_level =            0
 (med_io_init) : pio_rearr_comm_type = P2P           0
 (med_io_init) : pio_rearr_comm_fcd = 2DENABLE           0
 (med_io_init) : pio_rearr_comm_enable_hs_comp2io =  T
 (med_io_init) : pio_rearr_comm_enable_isend_comp2io =  F
 (med_io_init) : pio_rearr_comm_max_pend_req_comp2io =            0
 (med_io_init) : pio_rearr_comm_enable_hs_io2comp =  F
 (med_io_init) : pio_rearr_comm_enable_isend_io2comp =  T
 (med_io_init) : pio_rearr_comm_max_pend_req_io2comp =           64
 (med_io_init) calling pio_set_rearr_opts

It seems that it is not using parallel I/O. You could try to use parallel I/O by setting pio_typename. Here is an example for your ufs.configure file as following (you just need to update MED section).

# MED #
MED_model:                      cmeps
MED_petlist_bounds:             0 1599
MED_omp_num_threads:            1
MED_attributes::
  ATM_model = datm
  OCN_model = schism
  WAV_model = ww3
  history_n = 1
  history_option = nhours
  history_ymd = -999
  coupling_mode = coastal
  pio_typename = PNETCDF
::

You could also try to use different number for pio_stride (the stride of IO tasks across available compute tasks). You could try to set it to 4, 8 etc. to see any performance improvement. The default pio_rearranger is set to subset which is fine for high processor counts. The default pio_numiotasks is 4 which seems not enough. So, if you set pio_stride to 4 then you will have 1600/4 = 400 pio_numiotasks. Anyway, please experiment those numbers and let me know how it goes. Please also check mediator.log file and be sure that the PIO options are changing based on your configuration file. If you have successful run then you might collect timing to find the best configuration for this case. The numbers could be different for the other cases that uses more core.

DeniseWorthen commented 1 week ago

Outside of the CESM, you can set the PIO options (numio tasks, etc) via config. See https://github.com/oceanmodeling/CMEPS/blob/437d5e6f3507b84cdabaac02b0335e86d3013dc6/mediator/med_io_mod.F90#L184

Also, a question. Why are you having mediator history files written? That is a lot of I/O! Normally history files are used for debugging or diagnosing field exchange issues. They are not used in production runs.

uturuncoglu commented 1 week ago

@DeniseWorthen I agree with you. In the development, I am activating mediator history and restart to check the exchanged fields but @yunfangsun could disable in his production runs. On the other hand, it would be nice to figure out the issue with the mediator I/O. So, once we ned to debug something with the high-res case, it would be available. I bet that it is related with the serial I/O (which is default).

jedwards4b commented 1 week ago

I think that there is an issue with the history write alarm, nothing to do with pio - I am working on that today.

DeniseWorthen commented 1 week ago

@uturuncoglu Yes, I agree that you need to switch to pnetcdf at a minimum. I also believe there is an issue w/ WW3 restarts when using PDLIB. So another idea might be testing w/o the ww3 restarts (set the date%restart2%stride to some value > than the run length. You need to set the value in seconds.

uturuncoglu commented 1 week ago

@DeniseWorthen Thanks. I think we need to open an issue related with the WW3 restart in our end to track it. We did not test coastal specific ocean models restart capability. So, we don't know they will restart perfectly or not in a coupled application. I raised this couple of times in our internal meetings but at this point, we don't have enough resource to check them. @janahaddad maybe you could create an issue for the restart capability and once we have time we could just focus restart capability of ROMS and SCHSIM.

DeniseWorthen commented 1 week ago

@uturuncoglu I understand about not testing restart capability at this time. The issue I've heard about second-hand is that restart writing when using PDLIB is very very slow. If you don't need WW3 restarts, I wouldn't write them.

uturuncoglu commented 1 week ago

@DeniseWorthen Okay. Thanks for the clarification and your help. I was thinking there is an issue in the restart files itself. Good to know. Maybe this is more problematic for the high resolution cases. @yunfangsun you could also try to disable writing restart files as @DeniseWorthen suggested to see any performance improvement.

yunfangsun commented 1 week ago

Hi @uturuncoglu ,

1) I have added pio_typename = PNETCDF to the 1600 cores case, it did work, and the model could continue to run without hanging, the case location is at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5

2) With a different pio_stride number didn't make any change for the speed, the case is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5 _1

Thank you for the help.

DeniseWorthen commented 1 week ago

@yunfangsun From your mediator.log file, I see that it is still using a default of 4 iotasks. If you have 1600 cores for CMEPS, I would suggest that you try setting the number of iotasks higher. For a stride of 4 you should be able to get 400 IO tasks. Set the number of tasks with pio_numiotasks=.

uturuncoglu commented 1 week ago

@yunfangsun That is great news. Glad that it worked. I agree with @DeniseWorthen about increasing number of task for I/O. If you have time and don't mind, could you have couple of run (different number of I/O tasks, side etc.) and collect some timing results. It would be nice to change one parameter at a time to see its effect. I think that would be very helpful for the future and we could use it as reference for other cases. In actual runs, we could disable mediatory history and restart or write them just end of the simulation to optimize the I/O more.

yunfangsun commented 1 week ago

Hi @uturuncoglu ,

Sure, I will do the test when Hercules is back online.

yunfangsun commented 6 days ago

Hi @uturuncoglu and @DeniseWorthen ,

I have tried to use different pio_numiotasks settings:

pio_numiotasks = 8 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_2) pio_numiotasks = 16 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_3) pio_numiotasks = 32 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_4)

The speeds have no differences for the three cases.

uturuncoglu commented 6 days ago

@yunfangsun Thanks for having additional tests. The results are little bit Interesting. I wonder your frequency to write history and restart files. I think if you increase that frequency you might start seeing some difference. If you don't mind could you check your configuration?

yunfangsun commented 6 days ago

Hi @uturuncoglu

The frequency for the history and restart files is the same for all the tests, which is as follows

runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_ocn_accum
  MED med_phases_prep_ocn_avg
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> OCN :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  OCN
  WAV
  ATM -> MED :remapMethod=redist
  OCN -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_ocn
  MED med_phases_post_wav
  MED med_phases_history_write
  MED med_phases_restart_write
@
::

uturuncoglu commented 6 days ago

@yunfangsun the history and restart file interval basically configured via ufs.configure and it is not related with the run sequence. Could you share your ufs.configure?

yunfangsun commented 6 days ago

Hi @uturuncoglu

You can check out the case in /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_4

The part related to history and restart is as follows:

# MED #
MED_model:                      cmeps
MED_petlist_bounds:             0 1599
MED_omp_num_threads:            1
MED_attributes::
  ATM_model = datm
  OCN_model = schism
  WAV_model = ww3
  history_n = 1
  history_option = nhours
  history_ymd = -999
  coupling_mode = coastal
  pio_typename = PNETCDF
  pio_numiotasks = 32
::

ALLCOMP_attributes::
  ScalarFieldCount = 3
  ScalarFieldIdxGridNX = 1
  ScalarFieldIdxGridNY = 2
  ScalarFieldIdxNextSwCday = 3
  ScalarFieldName = cpl_scalars
  start_type = startup
  restart_dir = RESTART/
  case_name = ufs.cpld
  restart_n = 12
  restart_option = nhours
  restart_ymd = -999
  orb_eccen = 1.e36
  orb_iyear = 2022
  orb_iyear_align = 2022
  orb_mode = fixed_year
  orb_mvelp = 1.e36
  orb_obliq = 1.e36
  stop_n = 120
  stop_option = nhours
  stop_ymd = -999
::

uturuncoglu commented 6 days ago

@yunfangsun It seems that you are writing history file every hour and restart in every 12 hours. If you don't mind, could you confirm it from your run directory.

  history_n = 1
  history_option = nhours
  history_ymd = -999

  restart_n = 12
  restart_option = nhours
  restart_ymd = -999

yunfangsun commented 6 days ago

Hi @uturuncoglu ,

Yes, I can confirm it.

oceanmodeling / CMEPS

history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case #1