oceanmodeling / ufs-weather-model

This repo is forked from ufs-weather-model, and contains the model code and external links needed to build the UFS coastal model executable and model components, including the ROMS, FVCOM, ADCIRC and SCHISM plus WaveWatch III model components.
https://github.com/oceanmodeling/ufs-coastal-app
Other
5 stars 5 forks source link

WW3 Integration #7

Closed uturuncoglu closed 1 week ago

uturuncoglu commented 1 year ago

@pvelissariou1 I am opening this issue for WW3 integration and testing. Since WW3 is integral part of the ufs-weather-model, I'll try to test existing configurations under CoastalApp-testsuite in here to see the issues that we might face.

uturuncoglu commented 11 months ago

@pvelissariou1 @saeed-moghimi-noaa I think we need to get some help for ww3. At this point, new cap (also used by all the wave configurations under UFS Weather Model except one single test that uses NUOPC connectors) has some issue with atm2sch2wav. I could able to run atm2sch and atm2wav without any issue. The coupling through wav2sch is using radiation stresses and they were not available through the new cap but I activate it. At this point, the call that will calculate the radiation stresses returns all zero (call CalcRadstr2D( va, sxxn, sxyn, syyn)) and when I checked va (the input to the call) is all zero too. there could be some configuration issue in here that needs to be fixed.

uturuncoglu commented 11 months ago

@pvelissariou1 @saeed-moghimi-noaa It seems that S2S application is able to provide those fields without any issue. At this point we are thinking that there is some option in the ww3_grid.inp that prevents to get non zero radiation stresses. Please see the discussion on https://github.com/NOAA-EMC/WW3/issues/1110. I think if we find that difference and fix the issue in the configuration, we could able to couple ww3 with schsim using new mesh cap.

yunfangsun commented 9 months ago

Hi @aliabdolali @AliS-Noaa ,

Happy New Year!

I have an issue running WW3 on Hera and Hercules, for my Atlantic case, which is about ~5M nodes (subset from 120m ADCIRC mesh), I was using 800 cores for 8 hours to have 2-days' simulation results on Hera, so I have to use about 5000 cores to finish 1-month simulation within 8 hours.

Its timestep setting is as follows:

$ Set time steps ----------------------------------------------------- $
$ - Time step information (this information is always read)
$     maximum global time step, maximum CFL time step for x-y and
$     k-theta, minimum source term time step (all in seconds).
$
$
   100. 100. 100. 100.

The test case is at /scratch2/STI/coastal/Yunfang.Sun/ww3_hera/ian_noobc_1.

Do you have any suggestions on how to increase the simulation speed?

Thank you very much!

Best,

Yunfang

saeed-moghimi-noaa commented 9 months ago

Hi @yunfangsun

Do you know if you are doing explicit or implicit runs? Would you please paste the whole main inp file where you define the parameters?

Thanks

yunfangsun commented 9 months ago

Hi Saeed,

I was using the implicit scheme, where scheme selections are as follows:

EXPFSN = F,
  EXPFSPSI = F,
  EXPFSFCT = F,
  IMPFSN = F,
  EXPTOTAL = F,
  IMPTOTAL = T,
  IMPREFRACTION = T,
  IMPFREQSHIFT = T,
  IMPSOURCE = T,

And the whole grd.inp is as follows:

$ -------------------------------------------------------------------- $
$ WAVEWATCH III Grid preprocessor input file                           $
$ -------------------------------------------------------------------- $
$ Grid name (C*30, in quotes)
$
  'atlantic'
$
$ Frequency increment factor and first frequency (Hz) ---------------- $
$ number of frequencies (wavenumbers) and directions, relative offset
$ of first direction in terms of the directional increment [-0.5,0.5].
$ In versions 1.18 and 2.22 of the model this value was by definiton 0,
$ it is added to mitigate the GSE for a first order scheme. Note that
$ this factor is IGNORED in the print plots in ww3_outp.
$
 1.10 0.05 32 36 0.
$
$ Set model flags ---------------------------------------------------- $
$  - FLDRY         Dry run (input/output only, no calculation).
$  - FLCX, FLCY    Activate X and Y component of propagation.
$  - FLCTH, FLCK   Activate direction and wavenumber shifts.
$  - FLSOU         Activate source terms.
$
   F T T T T T
$
$ Set time steps ----------------------------------------------------- $
$ - Time step information (this information is always read)
$     maximum global time step, maximum CFL time step for x-y and
$     k-theta, minimum source term time step (all in seconds).
$
$
   100. 100. 100. 100.
   101. $ Start of namelist input section ------------------------------------ $
$   Starting with WAVEWATCH III version 2.00, the tunable parameters
$   for source terms, propagation schemes, and numerics are read using
$   namelists. Any namelist found in the folowing sections up to the
$   end-of-section identifier string (see below) is temporarily written
$   to ww3_grid.scratch, and read from there if necessary. Namelists
$   not needed for the given switch settings will be skipped
$   automatically, and the order of the namelists is immaterial.
$
$ This is TEST405
$
&SIN4 BETAMAX = 1.55, ZALP=0.006, ZWND = 5.,
Z0MAX = 0.0020, SINTHP=2.0, SWELLFPAR = 3, SWELLF = 0.80,
TAUWSHELTER = 0.0, SWELLF2=-0.018, SWELLF3= 0.015, Z0RAT = 0.04,
SWELLF4 = 100000, SWELLF5 = 1.2 /
$&SDS4 SDSBCHOICE = 1.0, SDSC2 = -0.2200E-04, SDSCUM = -0.40,
$      SDSC4 =  1.00, SDSC5 =  0.0000E+00, SDSC6 =  0.3000E+00,
$      WNMEANP =0.50, FXPM3 =4.00, FXFM3 = 2.5, FXFMAGE = 0.000,
$      SDSBINT =  0.3000E+00, SDSBCK =  0.0000E+00, SDSABK = 1.500, SDSPBK = 4.000,
$      SDSHCK = 1.50, SDSBR =   0.9000E-03, SDSSTRAIN =  0.0, SDSSTRAINA =15.0, SDSSTRAIN2 =  0.0,
$      SDSBT = 0.00, SDSP = 2.00, SDSISO = 2, SDSCOS =2.0, SDSDTH = 80.0,
$      SDSBRF1 =  0.50, SDSBRFDF = 0,
$      SDSBM0 =  1.00, SDSBM1 = 0.00, SDSBM2 = 0.00, SDSBM3 = 0.00, SDSBM4 = 0.00,
$      SPMSS =  0.50, SDKOF = 3.00, SDSMWD = 0.90, SDSFACMTF =400.0,
$      SDSMWPOW =1.5, SDSNMTF = 1.00, SDSCUMP =2.0, SDSNUW =.000E+00,
$      WHITECAPWIDTH = 0.30 WHITECAPDUR = 0.56 /
$
&OUTS E3D = 1, TH1MF = 1, STH1MF = 1 /
&UNST UGOBCAUTO = F,
  UGOBCDEPTH= -10.,
  EXPFSN = F,
  EXPFSPSI = F,
  EXPFSFCT = F,
  IMPFSN = F,
  EXPTOTAL = F,
  IMPTOTAL = T,
  IMPREFRACTION = T,
  IMPFREQSHIFT = T,
  IMPSOURCE = T,
  SETUP_APPLY_WLV = F,
  SOLVERTHR_SETUP=1E-14,
  CRIT_DEP_SETUP=0.1,
  JGS_USE_JACOBI = T,
  JGS_BLOCK_GAUSS_SEIDEL = T,
  JGS_TERMINATE_MAXITER = T,
  JGS_MAXITER = 1000,
  JGS_TERMINATE_NORM = F,
  JGS_TERMINATE_DIFFERENCE = T,
  JGS_DIFF_THR = 1.E-8,
  JGS_PMIN = 3.0,
  JGS_LIMITER = F,
  JGS_NORM_THR = 1.E-20 /
$
$ Bottom friction  - - - - - - - - - - - - - - - - - - - - - - - - - -
aliabdolali commented 9 months ago

Happy new year NOS and NOAA team. First, why you picked 100s for your time step? What is the CFL based on minimum resolution? This is the key here. You need to calculate the resolution of your entire mesh based on physical distance (not in degrees), then calculate the group velocity based on minimum frequency and then time step can be chosen with CFL=5-10. VERY CLASSIC. Second, once an optimum time step is chosen, you can change the number of iterations, and relative threshold to speed it up (I do not recommend it as we spend a considerable amount of time to fine tune them). Third, what version of WW3 are you using? if it is the most recent one, it should be fast enough. a storm condition of 2 weeks time on a 5M node can be done in 8 hrs or so on ~1000-2000 cpus. If you are using the old version of WW3 (the one which is now 2-3 yrs old), I'd recommend to not worry about the speed, as it is temp and you will gain speed once you switch to the most recent version of WW3.

@sbanihash @AliS-Noaa @saeed-moghimi-noaa @pvelissariou1

yunfangsun commented 9 months ago

Hi Ali @aliabdolali

Thank you very much! I will use the CFL condition to choose the time step.

And could I know where I could change the number of iterations, and relative threshold in the namelist? I am not very familiar with it.

The WW3 I am using is the version used in Ufs-coastal which is 02693d837f2cd99d20ed08515878c2b5e9525e64 (modified 3 months ago), is the speed of this version slower than the most recent one?

Thank you very much!

Best,

Yunfang

aliabdolali commented 9 months ago

The definitions are all listed here: https://github.com/erdc/WW3/blob/develop/model/inp/ww3_grid.inp but as I said, I'd recommend to not change them.

A code from 3 months ago is good enough.

janahaddad commented 8 months ago

@yunfangsun I'd suggest creating a new issue for your initial timestepping question , assigning to yourself, and adding to SurgeTeamCoordinationProject

yunfangsun commented 8 months ago

@janahaddad I have done it as you suggested

yunfangsun commented 8 months ago

@uturuncoglu @pvelissariou1,

Hi Ufuk,

For the ATM+WW3 case, it could start but only produced the 20220915.000000.out_grd.ww3.nc and 20220915.010000.out_grd.ww3.nc, and then the job is still hanging but won't move on, and for the log.ww3 it stopped at

      0|     1| 2022/09/15 00:00:00 |   F                   | X                |
        36|     1|            01:00:00 |   X                   | X                |
  --------+------+---------------------+-----------------------+------------------+

And the PET0960.ESMF_LogFile file stopped at the following

20240112 160941.197 INFO             PET0960 (wav_comp_nuopc:wavinit_ufs) call w3init
20240112 161106.748 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Export Field = cpl_scalars is connected on root pe
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Export Field = Sw_z0 is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Export Field = Sw_wavsuu is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Export Field = Sw_wavsuv is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Export Field = Sw_wavsvv is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Export Field = Sw_pstokes_x is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Export Field = Sw_pstokes_y is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Import Field = Si_ifrac is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Import Field = So_u is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Import Field = So_v is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Import Field = So_t is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Import Field = Sa_tbot is not connected.
20240112 161106.761 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Import Field = Sa_u10m is connected using mesh
20240112 161106.762 INFO             PET0960 (wav_import_export:fldlist_realize)(wav_import_export:realize_fields):WW3Import Field = Sa_v10m is connected using mesh
20240112 161110.554 DEBUG            PET0960 about to destroy Mesh: 0x6118290
20240112 161121.543 INFO             PET0960 (wav_comp_nuopc):(ModelSetRunClock) called
20240112 161121.543 INFO             PET0960 (wav_comp_nuopc):(ModelSetRunClock)setting alarms for WAV

the datm.log stopped at

(shr_strdata_readstrm) opening   : era5/download_inv_fix.nc
(shr_strdata_readstrm) setting pio descriptor : era5/download_inv_fix.nc
(shr_strdata_set_stream_iodesc) setting iodesc for : u10 with dimlens(1), dimlens(2) =     1440       721   variable as time dimension time
(shr_strdata_readstrm) reading file lb: era5/download_inv_fix.nc     337
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     338
 atm : model date     20220915           0
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     339
 atm : model date     20220915        3600

The mediator.log stopped at


(med_time_alarmInit): creating alarm alarm_history_inst_all
(med_phases_history_write)  initialized history alarm alarm_history_inst_all  with option nhours and frequency          1

(med_phases_history_write) : history alarmname alarm_history_inst_all is ringing, interval length is     3600
(med_phases_history_write) : mclock currtime = 2022-09-15-00000 mclock nexttime = 2022-09-15-03600

(med_phases_history_set_timeinfo) writing mediator history file ufs.cpld.cpl.hi.2022-09-15-03600.nc
(med_phases_history_set_timeinfo)   currtime = 2022-09-15-00000 nexttime = 2022-09-15-03600
(med_io_wopen) creating file ufs.cpld.cpl.hi.2022-09-15-03600.nc

I have tried a few times, the job never drops, but it also will not continue, I have to kill it and my folder is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_15740_atm_ww/coastal_ian_atm2ww3_intel_1 on Hercules.

Do you have any suggestions?

Thank you!

uturuncoglu commented 8 months ago

@yunfangsun I have just run coastal_ike_shinnecock_atm2ww3 case and it is running without any issue. Since you are using very high resolution application, it might take time to calculate required routehandles in ESMF side. Home many PETs are you assigning in each component (I have no permission to access your folder). So keep the run in the queue and see what happens. If that no work, we could try to attach gdb to the processes and try to collect backtrace to see where is the issue.

uturuncoglu commented 8 months ago

@yunfangsun Please post _petlist_bounds variables and their values. You could also try to remove MED med_phases_history_write and restart from run sequence see if that helps. We might need to play with pio (parallel io library used by mediator) settings to make it more efficient.

yunfangsun commented 8 months ago

@uturuncoglu I have changed the permission of my folder, you should be able to get access to it at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_15740_atm_ww/coastal_ian_atm2ww3_intel_1

For the cores, I am using

# EARTH #
EARTH_component_list: ATM WAV MED
EARTH_attributes::
  Verbosity = 0
::

# MED #
MED_model:                      cmeps
MED_petlist_bounds:             0 100
MED_omp_num_threads:            1
MED_attributes::
  ATM_model = datm
  WAV_model = ww3
  history_n = 1
  history_option = nhours
  history_ymd = -999
  coupling_mode = coastal
::

# ATM #
ATM_model:                      datm
ATM_petlist_bounds:             0 100
ATM_omp_num_threads:            1
ATM_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  OverwriteSlice = true
::

# WAV #
WAV_model:                      ww3
WAV_petlist_bounds:             101 4999
WAV_omp_num_threads:            1
WAV_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  merge_import = .false.
  mesh_wav = atlantic_ESMFmesh.nc
  multigrid = false
  gridded_netcdfout = true
  diro = "."
  logfile = wav.log
::

# Run Sequence #
runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  WAV
  ATM -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_wav
  MED med_phases_restart_write
  MED med_phases_history_write
@
::

ALLCOMP_attributes::
  ScalarFieldCount = 3
  ScalarFieldIdxGridNX = 1
  ScalarFieldIdxGridNY = 2
  ScalarFieldIdxNextSwCday = 3
  ScalarFieldName = cpl_scalars
  start_type = startup
  restart_dir = RESTART/
  case_name = ufs.cpld
  restart_n = 12
  restart_option = nhours
  restart_ymd = -999
  orb_eccen = 1.e36
  orb_iyear = 2000
  orb_iyear_align = 2000
  orb_mode = fixed_year
  orb_mvelp = 1.e36
  orb_obliq = 1.e36
  stop_n = 36
  stop_option = nhours
  stop_ymd = -999
::
uturuncoglu commented 8 months ago

@yunfangsun please remove history and restart write from the run sequence and then increase the number of cores for mediator. So, set like 0 4999 so mediator will run on all the processors. If this helps to run the case, then try to add mediator history and restart to the run sequence again to see what happens. I think you don't need to put those to your run sequence but at least it would be nice to have the restart ones. We might also look at pio settings if those help and it might help to improve the io performance.

yunfangsun commented 8 months ago

@uturuncoglu I have changed it to # MED #

MED_model:                      cmeps
MED_petlist_bounds:             0 4999
MED_omp_num_threads:            1
MED_attributes::
  ATM_model = datm
  WAV_model = ww3
  history_n = 1
  history_option = nhours
  history_ymd = -999
  coupling_mode = coastal
::

# ATM #
ATM_model:                      datm
ATM_petlist_bounds:             0 100
ATM_omp_num_threads:            1
ATM_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  OverwriteSlice = true
::

# WAV #
WAV_model:                      ww3
WAV_petlist_bounds:             101 4999
WAV_omp_num_threads:            1
WAV_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  merge_import = .false.
  mesh_wav = atlantic_ESMFmesh.nc
  multigrid = false
  gridded_netcdfout = true
  diro = "."
  logfile = wav.log
::
# Run Sequence #
runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  WAV
  ATM -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_wav
  MED med_phases_restart_write
@
::

ALLCOMP_attributes::
  ScalarFieldCount = 3
  ScalarFieldIdxGridNX = 1
  ScalarFieldIdxGridNY = 2
  ScalarFieldIdxNextSwCday = 3
  ScalarFieldName = cpl_scalars
  start_type = startup
  restart_dir = RESTART/
  case_name = ufs.cpld
  restart_n = 12
  restart_option = nhours
  restart_ymd = -999
  orb_eccen = 1.e36
  orb_iyear = 2000
  orb_iyear_align = 2000
  orb_mode = fixed_year
  orb_mvelp = 1.e36
  orb_obliq = 1.e36
  stop_n = 36
  stop_option = nhours
  stop_ymd = -999
::

Is the modification correct to your suggestion?

uturuncoglu commented 8 months ago

@yunfangsun Yes. That is correct. Please also remove history and restart phases from run sequence.

yunfangsun commented 8 months ago

Hi @uturuncoglu

::
# Run Sequence #
runSeq::
@3600
  MED -> ATM :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  WAV
  ATM -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
@
::

Is this one correct?

uturuncoglu commented 8 months ago

@yunfangsun You need to use following

# Run Sequence # 
runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  WAV
  ATM -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_wav
  MED med_phases_restart_write
  MED med_phases_history_write
@
::

and you could remove

  MED med_phases_restart_write
  MED med_phases_history_write

from it and test it. If it works. Try to add MED med_phases_restart_write.

yunfangsun commented 8 months ago

@uturuncoglu

I should firstly try

# Run Sequence #
runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  WAV
  ATM -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_wav
@
::

Is my understanding correct?

uturuncoglu commented 8 months ago

@yunfangsun Yes.

yunfangsun commented 8 months ago

@uturuncoglu Thank you! I have just submitted it

yunfangsun commented 8 months ago

@uturuncoglu

Now it could run 40 hours, it stopped at 09-16-16:00, and the mediator.log shows:


  Add wevap to budgets with index           20
  Add wrunoff to budgets with index           21
  Add wfrzrof to budgets with index           22
  Add saltf to budgets with index           23
  Add     inst to budgets with index            1
  Add all_time to budgets with index            2

(med.F90:DataInitialize) read_restart =  F

(med_time_alarmInit): creating alarm med_profile_alarm

(med_time_alarmInit): creating alarm alarm_stop

and log.ww3 shows:

     1332|    37|            13:00:00 |   X                   | X                |
  --------+------+---------------------+-----------------------+------------------+
      1368|    38|            14:00:00 |   X                   | X                |
  --------+------+---------------------+-----------------------+------------------+
      1404|    39|            15:00:00 |   X                   | X                |
  --------+------+---------------------+-----------------------+------------------+
      1440|    40|            16:00:00 |   X                   | X                |
  --------+------+---------------------+-----------------------+------------------+
ymd2date currTime wav_comp_nuopc hh,mm,ss,ymd  16   0   0  20220916

do you have any suggestions?

uturuncoglu commented 8 months ago

@yunfangsun is there anything in other log files such as err, out and datm.log? If you don't mind could you submit job again and see if is failing in the same place or not. If it dies not help, then please send me all the information to me to reproduce the run in my end.

yunfangsun commented 8 months ago

@uturuncoglu the datm.log seems there is no problem:

(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     370
 atm : model date     20220916       28800
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     371
 atm : model date     20220916       32400
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     372
 atm : model date     20220916       36000
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     373
 atm : model date     20220916       39600
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     374
(dshr_restart_write)  writing ufs.cpld.datm.r.2022-09-16-43200.nc20220916  43200
 atm : model date     20220916       43200
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     375
 atm : model date     20220916       46800
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     376
 atm : model date     20220916       50400
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     377
 atm : model date     20220916       54000
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     378
 atm : model date     20220916       57600
(shr_strdata_readstrm) reading file ub: era5/download_inv_fix.nc     379
 atm : model date     20220916       61200

The out file is also normal


101:  No. of solver iterations          20     2379724   3.65856522059484
 101:    3.00000000000000
 101:  No. of solver iterations          30     2435127   1.41561414261967
 101:    3.00000000000000
 101:  No. of solver iterations           0     1720715   30.3380762027680
 101:    3.00000000000000
 101:  No. of solver iterations          10     2163503   12.4121187290848
 101:    3.00000000000000
 101:  No. of solver iterations          20     2379713   3.65901054777672
 101:    3.00000000000000
 101:  No. of solver iterations          30     2435206   1.41241588376799
 101:    3.00000000000000
 101:  No. of solver iterations           0     1720715   30.3380762027680
 101:    3.00000000000000
 101:  No. of solver iterations          10     2164108   12.3876257340814
 101:    3.00000000000000
 101:  No. of solver iterations          20     2379921   3.65058981561026
 101:    3.00000000000000
 101:  No. of solver iterations          30     2435584   1.39711282242700
 101:    3.00000000000000
 101:  No. of solver iterations           0     1720715   30.3380762027680
 101:    3.00000000000000
 101:  No. of solver iterations          10     2164606   12.3674645580290
 101:    3.00000000000000
 101:  No. of solver iterations          20     2380067   3.64467910937802
 101:    3.00000000000000
 101:  No. of solver iterations          30     2435578   1.39735572816257
 101:    3.00000000000000
 101:  No. of solver iterations           0     1720715   30.3380762027680
 101:    3.00000000000000
 101:  No. of solver iterations          10     2164854   12.3574244542920
 101:    3.00000000000000
 101:  No. of solver iterations          20     2379836   3.65403098019751
 101:    3.00000000000000

The error file is


+ ESMF_RUNTIME_PROFILE=ON
+ export ESMF_RUNTIME_PROFILE_OUTPUT=SUMMARY
+ ESMF_RUNTIME_PROFILE_OUTPUT=SUMMARY
+ [[ intel == gnu ]]
+ sync
+ sleep 1
+ srun --label -n 5000 ./fv3.exe
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
4196: forrtl: error (78): process killed (SIGTERM)
4196: Image              PC                Routine            Line        Source
4196: libc.so.6          000014E95E53CD90  Unknown               Unknown  Unknown
4196: fv3.exe            0000000001D474CA  pdlib_w3profsmd_m        5922  w3profsmd_pdlib.F90
4196: fv3.exe            0000000001D3D635  pdlib_w3profsmd_m        2796  w3profsmd_pdlib.F90
4196: fv3.exe            0000000001BE4E68  w3wavemd_mp_w3wav        1843  w3wavemd.F90
4196: fv3.exe            00000000019E9D62  wav_comp_nuopc_mp        1126  wav_comp_nuopc.F90
4196: fv3.exe            0000000000C37EB8  Unknown               Unknown  Unknown
4196: fv3.exe            0000000000C37E27  Unknown               Unknown  Unknown
4196: fv3.exe            0000000000C36A03  Unknown               Unknown  Unknown
4196: fv3.exe            0000000000433182  Unknown               Unknown  Unknown
4196: fv3.exe            000000000205FCDD  Unknown               Unknown  Unknown
4196: fv3.exe            0000000000B36B84  Unknown               Unknown  Unknown

The whole run folder is located on Hercules/Orion at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_15740_atm_ww/coastal_ian_atm2ww3_intel_1, and the compilation I am using is the coastal_ike_shinnecock_atm2ww3_intel regression test.

I have just resubmitted it, and will let you know when it fails

yunfangsun commented 8 months ago

Hi @uturuncoglu I have tried twice by submitting job_card, and it stopped at different times, 36 hours and 37 hours, then I tried to use this same setting, but wind is using wind.ww3 which is interpolated from the exact era5 by using ww3_prnc, and use stand-alone ww3 by submitting (xmodel_slurm.job), and it didn't break, and kept running. all the files are located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_15740_atm_ww/coastal_ian_atm2ww3_intel_1

uturuncoglu commented 8 months ago

@yunfangsun It seems that the issue in WW3 side. Maybe it is not getting fields from DATM. Let me check.

uturuncoglu commented 8 months ago

@yunfangsun It seems your are using ınp. format for configuration but in the RT I am using nml and when you couple the WW3 you need to change D to C for the wind but I am not seeing those kind of definition in your configuration file. So, could you use ww3_shel.nml file from RT and modify its simulation date and run again to see what happens.

yunfangsun commented 8 months ago

@uturuncoglu I have removed the ww3_shel.inp, and replaced it with ww3_shel.nml, the job stopped again after 40 hours.

uturuncoglu commented 8 months ago

@yunfangsun I could run your case on Hercules with Intel. At this point the run is in 55 hours and still running. I just compile the case with latest version of ufs-coastal using following command,

./compile.sh "hercules" "-DAPP=CSTLW -DPDLIB=ON" coastal intel NO NO

except using Intel (I am not sure but maybe you are using GNU), I did not change anything in the configuration. Anyway, I'll let you know if it fails but you might try with Intel indoor and let me know how it goes.

uturuncoglu commented 8 months ago

@yunfangsun The performance of the model is something like 55 hours in 44 min but this includes also initialization. So, roughly you can do 75 hours/hours. It seems that you are tying to run 900 hours. So, I think this will not finish in 8 hours time window. I suggest to increase the number of cores more. Maybe you could try with doubling WW3 resource. Anyway, of course this runs without any issue in your end.

yunfangsun commented 8 months ago

@uturuncoglu I have download the newest version of ufs-coastal at /work2/noaa/nos-surge/yunfangs/ufs-coastal

and compile it by using ./compile.sh "hercules" "-DAPP=CSTLW -DPDLIB=ON" coastal intel NO NO, and the run folder is in /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_15740_atm_ww/coastal_ian_atm2ww3_intel_3, it stopped after 40 hours again,

and compile it by using regression test ./rt.sh-a coast -l rt_coastal.conf -c -k -n coastal_ike_shinnecock_atm2ww3 intel, run it in the folder /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_1795978_atm_ww3_new/coastal_ian_atm2ww3_intel, it stopped after 39 hours.

uturuncoglu commented 8 months ago

@yunfangsun I am not sure what is wrong in your case but mine is run until 20221012.000000 (started from 20220915.000000) without any issue under my account. It is almost 649 hours. So, we still need to increase the resource. BTW, this is my run directory if you want to compare with yours (/work/noaa/nems/tufuk/COASTAL/coastal_ian_atm2ww3_intel_1). I am still not sure ww3_shel.inp is correct or not. I'll try to run the same case by removing wind file from run directory and use nml to be sure it is getting wind from cdeps.

yunfangsun commented 8 months ago

@uturuncoglu Could you please change the permission of the folder /work/noaa/nems/tufuk/COASTAL/coastal_ian_atm2ww3_intel_1, I can't get access to it.

uturuncoglu commented 8 months ago

@yunfangsun i did it.

pvelissariou1 commented 8 months ago

@uturuncoglu, @yunfangsun, @saeed-moghimi-noaa Thank you so much for your response on such a short notice and over the weekend. Your help is greatly appreciated. @yunfang Thank you for spending time to resolve all the issues. Yunfang, could you please document all the steps you followed to setup the Ian application inside your RT folder and UFS-Coastal (configuration, compilation and run). I am doing the same thing with my simulations but configuring them in such a way so they can run outside the RT folder and without running them as a test case within UFS-Coastal. At the end I want to combine all the steps and requirements in one document. I'll report back on the progress and issues from all the simulations with and without waves.

uturuncoglu commented 8 months ago

@pvelissariou1 Agree. Documenting each step will help us to implement the workflow.

yunfangsun commented 8 months ago

@uturuncoglu my atm2sch2ww3 configuration is at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_11601_atm2sch2ww3/coastal_ian_atm2sch2ww3_intel_1 Thank you for your great help!

uturuncoglu commented 8 months ago

Okay. I’ll try to run that one tonight. I resubmitted your first case and it is in /work/noaa/nems/tufuk/COASTAL/coastal_ian_atm2ww3_intel_1. It is still running compared almost one day.

—ufuk

On Jan 16, 2024, at 2:54 PM, Yunfang Sun @.***> wrote:

@uturuncoglu https://github.com/uturuncoglu my atm2sch2ww3 configuration is at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_11601_atm2sch2ww3/coastal_ian_atm2sch2ww3_intel_1 Thank you for your great help!

— Reply to this email directly, view it on GitHub https://github.com/oceanmodeling/ufs-coastal/issues/7#issuecomment-1894576744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJMBR4R57RIP3SIMMCQJLLYO3ZKDAVCNFSM6AAAAAAZNTXHGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU3TMNZUGQ. You are receiving this because you were mentioned.

yunfangsun commented 8 months ago

@uturuncoglu thank you

uturuncoglu commented 8 months ago

@yunfangsun The DATM+WW3 run is finished. It is in /work/noaa/nems/tufuk/COASTAL/coastal_ian_atm2ww3_intel_1. Please check the results and let me know if you need any change. I'll run other job soon.

yunfangsun commented 8 months ago

@uturuncoglu the results file permission is denied, can you please change the permissions of the files? Thank you.

pvelissariou1 commented 8 months ago

@uturuncoglu Thank you

uturuncoglu commented 8 months ago

@yunfangsun Hercules is down now but maybe you could reach from Orion. I fixed the permission.

yunfangsun commented 8 months ago

@uturuncoglu thank you

yunfangsun commented 7 months ago

New configuration of stand alone WW3: for regression test: coastal_ike_shinnecock_ww3

WAV_attributes:: Verbosity = 0 DumpFields = false ProfileMemory = false merge_import = .false. mesh_wav = mesh.shinnecock.cdf5.nc multigrid = false gridded_netcdfout = true diro = "." logfile = wav.log standalone = true ::