coastal_ian_hsofs_datm2sch2ww3

yunfangsun commented 1 week ago

https://github.com/oceanmodeling/ufs-weather-model/issues/7#issuecomment-1890110575

https://github.com/oceanmodeling/ufs-weather-model/issues/7#issuecomment-1890141078

yunfangsun commented 1 week ago

@pvelissariou1 @janahaddad @saeed-moghimi-noaa @uturuncoglu

Hi Takis, do you have any experience for SCH+WW3 without a mediator?

I am trying to configure the SCH+WW3 on Hercules at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3_no_med

In the ufs.configure, I am using

# Run Sequence #
runSeq::
@3600
  ATM -> OCN :remapMethod=redist
  WAV -> OCN :remapMethod=redist
  ATM -> WAV :remapMethod=redist
  OCN -> WAV :remapMethod=redist
  ATM
  OCN
  WAV
@
::

After I submitted the job, I received the following error messages:

20240617 141652.736 ERROR            PET171 UFSDriver.F90:343 Not valid  -  No model was specified for component: MED
20240617 141652.736 ERROR            PET171 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:794 Not valid  - Passing error in return code
20240617 141652.736 ERROR            PET171 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Not valid  - Passing error in return code
20240617 141652.736 ERROR            PET171 UFS.F90:394 Not valid  - Aborting UFS

Do you have any suggestions?

Thank you!

pvelissariou1 commented 1 week ago

@yunfangsun How the complete ufs.configure file looks like?

yunfangsun commented 1 week ago

Hi @pvelissariou1 , it looks like the follows:

#############################################
####  NEMS Run-Time Configuration File  #####
#############################################

# ESMF #
logKindFlag:            ESMF_LOGKIND_MULTI
globalResourceControl:  true

# EARTH #
EARTH_component_list: ATM OCN WAV
EARTH_attributes::
  Verbosity = 0
::

# MED #

# ATM #
ATM_model:                      datm
ATM_petlist_bounds:             0 10
ATM_omp_num_threads:            1
ATM_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  OverwriteSlice = true
::

# OCN #
OCN_model:                      schism
OCN_petlist_bounds:             11 99
OCN_omp_num_threads:            1
OCN_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  OverwriteSlice = true
  meshloc = element
  CouplingConfig = none
::

# WAV #
WAV_model:                      ww3
WAV_petlist_bounds:             100 199
WAV_omp_num_threads:            1
WAV_attributes::
  Verbosity = 0
  DumpFields = false
  ProfileMemory = false
  merge_import = .false.
  mesh_wav = hsofs_ESMFmesh.nc
  multigrid = false
  gridded_netcdfout = true
  diro = "."
  logfile = wav.log
::

# Run Sequence #
runSeq::
@3600
  ATM -> OCN :remapMethod=redist
  WAV -> OCN :remapMethod=redist
  ATM -> WAV :remapMethod=redist
  OCN -> WAV :remapMethod=redist
  ATM
  OCN
  WAV
@
::

ALLCOMP_attributes::
  ScalarFieldCount = 3
  ScalarFieldIdxGridNX = 1
  ScalarFieldIdxGridNY = 2
  ScalarFieldIdxNextSwCday = 3
  ScalarFieldName = cpl_scalars
  start_type = startup
  restart_dir = RESTART/
  case_name = ufs.cpld
  restart_n = 12
  restart_option = nhours
  restart_ymd = -999
  orb_eccen = 1.e36
  orb_iyear = 2022
  orb_iyear_align = 2022
  orb_mode = fixed_year
  orb_mvelp = 1.e36
  orb_obliq = 1.e36
  stop_n = 120
  stop_option = nhours
  stop_ymd = -999
::

pvelissariou1 commented 1 week ago

@yunfangsun You have the MED component in ufs.configure EARTH_component_list: ATM OCN WAV MED I am looking in: /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3_no_med

pvelissariou1 commented 1 week ago

@yunfangsun Are you using a different location?

yunfangsun commented 1 week ago

Hi @pvelissariou1 , I have removed that MED, and the location is /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3_no_med After rerun, the error message in PET000.ESMF_LogFile becomes:

20240617 145253.019 ERROR            PET000 ESMCI_Array.C:6238 ESMCI::Array::tRedistStore() Invalid argument  - srcArray and dstArray must provide identical number of exclusive elements
20240617 145253.019 ERROR            PET000 ESMCI_Array.C:6018 ESMCI::Array::redistStore() Invalid argument  - Internal subroutine call returned Error
20240617 145253.019 ERROR            PET000 ESMCI_ArrayBundle.C:1014 ESMCI::ArrayBundle::redistStore( Invalid argument  - Internal subroutine call returned Error
20240617 145253.019 ERROR            PET000 ESMCI_ArrayBundle_F.C:506 c_esmc_arraybundlerediststorenf( Invalid argument  - Internal subroutine call returned Error
20240617 145253.019 ERROR            PET000 ESMF_ArrayBundle.F90:2525 ESMF_ArrayBundleRedistStoreNF() Invalid argument  - Internal subroutine call returned Error
20240617 145253.019 ERROR            PET000 ESMF_FieldBundle.F90:15302 ESMF_FieldBundleRedistStoreNF Invalid argument  - Internal subroutine call returned Error
20240617 145253.019 ERROR            PET000 ATM-TO-OCN:src/addon/NUOPC/src/NUOPC_Connector.F90:8154 Invalid argument  - Passing error in return code
20240617 145253.019 ERROR            PET000 ATM-TO-OCN:src/addon/NUOPC/src/NUOPC_Connector.F90:5440 Invalid argument  - Passing error in return code
20240617 145253.019 ERROR            PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3096 Invalid argument  - Phase 'IPDv05p6b' Initialize for connectorComp 1 -> 2: ATM-TO-OCN did not return ESMF_SUCCESS
20240617 145253.019 ERROR            PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2156 Invalid argument  - Passing error in return code
20240617 145253.019 ERROR            PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Invalid argument  - Passing error in return code
20240617 145253.019 ERROR            PET000 UFS.F90:394 Invalid argument  - Aborting UFS

pvelissariou1 commented 1 week ago

@yunfangsun You are using the NUOPC_MESH cap (switch: switch_meshcap_pdlib). NUOPC_MESH requires MED (CMEPS). Also it might be a mismatch between the atm mesh and the ocn/wav meshes: ESMCI_Array.C:6238 ESMCI::Array::tRedistStore() Invalid argument - srcArray and dstArray must provide identical number of exclusive elements You might need to recompile with the old cap MULTI_ESMF (similar to what we had in CoastalApp) to use the NUOPC connectors. This all I know.

yunfangsun commented 1 week ago

Hi @pvelissariou1 ,

Could I know if you have any examples for SCHISM+WW3 for UFS-Coastal without CMEPS?

Thank you!

pvelissariou1 commented 1 week ago

@yunfangsun I don't have an example for ufs-coastal running coupled SCHISM+WW3 without CMEPS (never tried it). Why do you need to run SCHISM+WW3 this way?

yunfangsun commented 1 week ago

@pvelissariou1 I am running the ATM+SCH+WW3 case, when I was using 200 processors, the case could continue to run and finish 16 hours of simulation time (case is at/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3), however, when I try to use 6000 processors to run the same case (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4), the case could only run 12 hours of simulation time, and stuck at the MED without any error messages, @saeed-moghimi-noaa thought the problem may be caused by the the CMEPS, that's the reason why I would like to run a coupled SCHISM+WW3 without CMEPS .

pvelissariou1 commented 1 week ago

@yunfangsun , @saeed-moghimi-noaa If it stopped after 12 hours on CMEPS, it might be a configuration issue. I am checking in: /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4

In ufs.configure change MED_petlist_bounds: 0 100 to MED_petlist_bounds: 0 5999. CMEPS #of cores should be equal to the total of cores used by all model components.
In ufs.configure (near the bottom) you have: stop_n = 120 which instructs the simulation to stop after 120 hrs. In your model_configure you have nhours_fcst: 900 that is you want to run the simulation for 900 hrs (these two should match if you intend to run the simulation for 900 hrs)
In your param.nml you have rnday=17.0 (=408 hrs), which does not much the simulation hours defined above
In your ww3_shel.nml you have domain%start = '20220915 000000' and domain%stop = '20221002 000000' as the start and end of the simulation time which they don't match the simulation times defined above.

Is it possible to modify all these to match and rerun the simulation? I would do it in the same folder (first delete all previously generated output and log files).

I didn't see anything else at this point. Let me know how it goes as I am interested in this as well.

yunfangsun commented 1 week ago

Hi @pvelissariou1 ,

Thank you for your suggestions, I have modified the configuration in the folder of/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1

1 In ufs.configure change MED_petlist_bounds: 0 100 to MED_petlist_bounds: 0 5999. 2 ufs.configure stop_n = 408model_configure you havenhours_fcst: 408 3 param.nml you have rnday=17.0 4 ww3_shel.nml you have domain%start = '20220915 000000'and domain%stop = '20221002 000000'

After the modifications, the simulation is still stopped after 12 hours.

pvelissariou1 commented 1 week ago

@yunfangsun Let me check. Have a meeting now

pvelissariou1 commented 1 week ago

@yunfangsun The only thing I found, at this point, is that in your ufs.configure file stop_option is set equal to nhours which is set to 12. Can you change stop_option = nhours to stop_option = 408 to see if it makes any difference?

yunfangsun commented 1 week ago

Hi @pvelissariou1 ,

I have modified as follows:

ALLCOMP_attributes::
  ScalarFieldCount = 3
  ScalarFieldIdxGridNX = 1
  ScalarFieldIdxGridNY = 2
  ScalarFieldIdxNextSwCday = 3
  ScalarFieldName = cpl_scalars
  start_type = startup
  restart_dir = RESTART/
  case_name = ufs.cpld
  restart_n = 12
  restart_option = nhours
  restart_ymd = -999
  orb_eccen = 1.e36
  orb_iyear = 2022
  orb_iyear_align = 2022
  orb_mode = fixed_year
  orb_mvelp = 1.e36
  orb_obliq = 1.e36
  stop_n = 408
  stop_option = 408
  stop_ymd = -999

The job directly dropped after submission, the error message in PET0000.ESMF_LogFile is as follows:

20240618 133412.726 INFO             PET0000 (med_phases_profile): done
20240618 133412.747 ERROR            PET0000 (med_time_alarmInit): unknown option 408
20240618 133412.747 ERROR            PET0000 med.F90:2327 Failure  - Passing error in return code
20240618 133412.747 ERROR            PET0000 MED:src/addon/NUOPC/src/NUOPC_ModelBase.F90:2108 Failure  - Passing error in return code
20240618 133412.747 ERROR            PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3702 Failure  - Phase 'med_phases_prep_atm' Run for modelComp 4 did not return ESMF_SUCCESS
20240618 133412.747 ERROR            PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3940 Failure  - Passing error in return code
20240618 133412.747 ERROR            PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3617 Failure  - Passing error in return code
20240618 133412.747 ERROR            PET0000 UFS.F90:411 Failure  - Aborting UFS

The work folder is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1

pvelissariou1 commented 1 week ago

@yunfangsun Ok, checking ...

pvelissariou1 commented 1 week ago

@yunfangsun Can you replace stop_option = 408 with stop_option = stop_n and re-run?

pvelissariou1 commented 1 week ago

@yunfangsun Hold on for a while, need to check something

pvelissariou1 commented 1 week ago

@yunfangsun It seems that there are a few options for nhours, and stop_option. I'll update you later.

yunfangsun commented 1 week ago

@pvelissariou1 Thank you! I have changed back to stop_option = stop_n, and submitted the job, and it is waiting for resources.

pvelissariou1 commented 1 week ago

@yunfangsun Please cancel the job. stop_option = stop_n is not a valid setup, I found out. I'll copy your run folder and I'll try to run the simulation. I'll keep you posted.

yunfangsun commented 1 week ago

@pvelissariou1 thank you!

pvelissariou1 commented 1 week ago

@yunfangsun , @saeed-moghimi-noaa , restart_n@janahaddad , Yunfang it seems that your issue is basically a CMEPS configuration issue, There are a few variables that control the length of the simulation (on the CMEPS side), like nhours, stop_n, stop_option, restart_n, ... I changed the value of nhours from 12 to 408, and set restart_n = 24 (instead of 12) and the simulation stopped at 24 hours now (restart_n is a CMEPS variable). I didn't pay attention to all these before. Let me check a bit more and I'll get back to you. Anyway, the ufs.configure and model_configure files (generated) need to be cleaned to contain variables relevant to ufs-coastal. Also when Ufuk comes back we need to have a discussion how to better generate these (and other configs) - UFSpy? Need to document all these and other settings thoroughly.

pvelissariou1 commented 1 week ago

@yunfangsun In the ufs.configure file the options:

restart_n = 12
restart_option = nhours

are responsible for this issue. The above instruct the mediator (CMEPS) to generate restart files, every 12 hours, for the coupling (not the model components). The options:

stop_n = 408
stop_option = nhours

inform the mediator that the simulation will end after 408 hours (17 days). Setting the above variables this way, it is expected that the simulation will go for 17 days, creating the CMEPS restart files every 12 hours; instead the simulation stops after 12 hours. I changed the restart_n, restart_options to:

restart_n = 24
restart_option = never

that is, no restart files are generated. Now the simulation proceeds as expected (as of this writting the simulation is 3 days in). Valid values for both *_option variables are: none,never,nsteps,nseconds,nminutes,nhours,ndays,nmonths,nyears,date,end

I have no idea why this is happening. Checked the CMEPS codes but I couldn't find anything. Is this a CMEPS bug? a misconfiguration in ufs.configure? Anyway the full simulation results are in: /work2/noaa/nos-surge/pvelissa/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1_T3 on hercules

janahaddad commented 1 week ago

run seems to be working, have 2 days of output with Takis' suggested fix
making new issue for the bug ...
Now focusing on model/obs comparison

pvelissariou1 commented 1 week ago

@janahaddad It is 17 days of output

janahaddad commented 1 week ago

@janahaddad It is 17 days of output

@pvelissariou1 Yep was just making a quick note during tag-up that @yunfangsun's current run with your fix + Ufuk's prior suggestion of turning off MED history write sequence seems to be running ok, as his run now has 2+ days of output which he can use (even if it ends up timing out on Hercules) to compare with obs and/or gut-check with the SCHISM only and WW3 only tests

pvelissariou1 commented 1 week ago

Maybe the pairs (history_n, history_option) and (restart_n, restart_option) bahave the same way. Switching them to "off" seems that fixes the current problem. May be should be removed or commented out completely from ufs.configure. Anyway, this is not the way it should be; it may need to be fixed from the ESMF side.

janahaddad commented 19 hours ago

Update as of Friday 6/28:

all schism runs with the HSOFS mesh (SCHISM outside UFS-C, SCHISM standalone inside UFS-C, DATM+SCHISM) are yiedling identical results: tidal signal is decent in amplitdue and phase, but not representing wind forcing
DATM+SCHISM+WW3 run is representing wind forcing and tidal signal
next step is to try all Phase 1 runs with the coarse Atl+Gulf domain mesh (550k nodes)
- see revised plan in #92

oceanmodeling / ufs-weather-model

coastal_ian_hsofs_datm2sch2ww3 #103