Open janahaddad opened 1 month ago
@pvelissariou1 @janahaddad @saeed-moghimi-noaa @uturuncoglu
Hi Takis, do you have any experience for SCH+WW3 without a mediator?
I am trying to configure the SCH+WW3 on Hercules at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3_no_med
In the ufs.configure, I am using
# Run Sequence #
runSeq::
@3600
ATM -> OCN :remapMethod=redist
WAV -> OCN :remapMethod=redist
ATM -> WAV :remapMethod=redist
OCN -> WAV :remapMethod=redist
ATM
OCN
WAV
@
::
After I submitted the job, I received the following error messages:
20240617 141652.736 ERROR PET171 UFSDriver.F90:343 Not valid - No model was specified for component: MED
20240617 141652.736 ERROR PET171 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:794 Not valid - Passing error in return code
20240617 141652.736 ERROR PET171 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Not valid - Passing error in return code
20240617 141652.736 ERROR PET171 UFS.F90:394 Not valid - Aborting UFS
Do you have any suggestions?
Thank you!
@yunfangsun How the complete ufs.configure file looks like?
Hi @pvelissariou1 , it looks like the follows:
#############################################
#### NEMS Run-Time Configuration File #####
#############################################
# ESMF #
logKindFlag: ESMF_LOGKIND_MULTI
globalResourceControl: true
# EARTH #
EARTH_component_list: ATM OCN WAV
EARTH_attributes::
Verbosity = 0
::
# MED #
# ATM #
ATM_model: datm
ATM_petlist_bounds: 0 10
ATM_omp_num_threads: 1
ATM_attributes::
Verbosity = 0
DumpFields = false
ProfileMemory = false
OverwriteSlice = true
::
# OCN #
OCN_model: schism
OCN_petlist_bounds: 11 99
OCN_omp_num_threads: 1
OCN_attributes::
Verbosity = 0
DumpFields = false
ProfileMemory = false
OverwriteSlice = true
meshloc = element
CouplingConfig = none
::
# WAV #
WAV_model: ww3
WAV_petlist_bounds: 100 199
WAV_omp_num_threads: 1
WAV_attributes::
Verbosity = 0
DumpFields = false
ProfileMemory = false
merge_import = .false.
mesh_wav = hsofs_ESMFmesh.nc
multigrid = false
gridded_netcdfout = true
diro = "."
logfile = wav.log
::
# Run Sequence #
runSeq::
@3600
ATM -> OCN :remapMethod=redist
WAV -> OCN :remapMethod=redist
ATM -> WAV :remapMethod=redist
OCN -> WAV :remapMethod=redist
ATM
OCN
WAV
@
::
ALLCOMP_attributes::
ScalarFieldCount = 3
ScalarFieldIdxGridNX = 1
ScalarFieldIdxGridNY = 2
ScalarFieldIdxNextSwCday = 3
ScalarFieldName = cpl_scalars
start_type = startup
restart_dir = RESTART/
case_name = ufs.cpld
restart_n = 12
restart_option = nhours
restart_ymd = -999
orb_eccen = 1.e36
orb_iyear = 2022
orb_iyear_align = 2022
orb_mode = fixed_year
orb_mvelp = 1.e36
orb_obliq = 1.e36
stop_n = 120
stop_option = nhours
stop_ymd = -999
::
@yunfangsun You have the MED component in ufs.configure
EARTH_component_list: ATM OCN WAV MED
I am looking in:
/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3_no_med
@yunfangsun Are you using a different location?
Hi @pvelissariou1 , I have removed that MED, and the location is /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3_no_med After rerun, the error message in PET000.ESMF_LogFile becomes:
20240617 145253.019 ERROR PET000 ESMCI_Array.C:6238 ESMCI::Array::tRedistStore() Invalid argument - srcArray and dstArray must provide identical number of exclusive elements
20240617 145253.019 ERROR PET000 ESMCI_Array.C:6018 ESMCI::Array::redistStore() Invalid argument - Internal subroutine call returned Error
20240617 145253.019 ERROR PET000 ESMCI_ArrayBundle.C:1014 ESMCI::ArrayBundle::redistStore( Invalid argument - Internal subroutine call returned Error
20240617 145253.019 ERROR PET000 ESMCI_ArrayBundle_F.C:506 c_esmc_arraybundlerediststorenf( Invalid argument - Internal subroutine call returned Error
20240617 145253.019 ERROR PET000 ESMF_ArrayBundle.F90:2525 ESMF_ArrayBundleRedistStoreNF() Invalid argument - Internal subroutine call returned Error
20240617 145253.019 ERROR PET000 ESMF_FieldBundle.F90:15302 ESMF_FieldBundleRedistStoreNF Invalid argument - Internal subroutine call returned Error
20240617 145253.019 ERROR PET000 ATM-TO-OCN:src/addon/NUOPC/src/NUOPC_Connector.F90:8154 Invalid argument - Passing error in return code
20240617 145253.019 ERROR PET000 ATM-TO-OCN:src/addon/NUOPC/src/NUOPC_Connector.F90:5440 Invalid argument - Passing error in return code
20240617 145253.019 ERROR PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3096 Invalid argument - Phase 'IPDv05p6b' Initialize for connectorComp 1 -> 2: ATM-TO-OCN did not return ESMF_SUCCESS
20240617 145253.019 ERROR PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2156 Invalid argument - Passing error in return code
20240617 145253.019 ERROR PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Invalid argument - Passing error in return code
20240617 145253.019 ERROR PET000 UFS.F90:394 Invalid argument - Aborting UFS
@yunfangsun You are using the NUOPC_MESH cap (switch: switch_meshcap_pdlib). NUOPC_MESH
requires MED (CMEPS)
.
Also it might be a mismatch between the atm mesh and the ocn/wav meshes:
ESMCI_Array.C:6238 ESMCI::Array::tRedistStore() Invalid argument - srcArray and dstArray must provide identical number of exclusive elements
You might need to recompile with the old cap MULTI_ESMF
(similar to what we had in CoastalApp) to use the NUOPC connectors. This all I know.
Hi @pvelissariou1 ,
Could I know if you have any examples for SCHISM+WW3 for UFS-Coastal without CMEPS?
Thank you!
@yunfangsun I don't have an example for ufs-coastal running coupled SCHISM+WW3 without CMEPS (never tried it). Why do you need to run SCHISM+WW3 this way?
@pvelissariou1
I am running the ATM+SCH+WW3 case, when I was using 200 processors, the case could continue to run and finish 16 hours of simulation time (case is at/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3
), however, when I try to use 6000 processors to run the same case (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4
), the case could only run 12 hours of simulation time, and stuck at the MED without any error messages, @saeed-moghimi-noaa thought the problem may be caused by the the CMEPS, that's the reason why I would like to run a coupled SCHISM+WW3 without CMEPS .
@yunfangsun , @saeed-moghimi-noaa If it stopped after 12 hours on CMEPS, it might be a configuration issue. I am checking in:
/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4
MED_petlist_bounds: 0 100
to MED_petlist_bounds: 0 5999
. CMEPS #of cores should be equal to the total of cores used by all model components.stop_n = 120
which instructs the simulation to stop after 120 hrs. In your model_configure you have nhours_fcst: 900
that is you want to run the simulation for 900 hrs (these two should match if you intend to run the simulation for 900 hrs)param.nml
you have rnday=17.0 (=408 hrs), which does not much the simulation hours defined aboveww3_shel.nml
you have domain%start = '20220915 000000'
and domain%stop = '20221002 000000'
as the start and end of the simulation time which they don't match the simulation times defined above.Is it possible to modify all these to match and rerun the simulation? I would do it in the same folder (first delete all previously generated output and log files).
I didn't see anything else at this point. Let me know how it goes as I am interested in this as well.
Hi @pvelissariou1 ,
Thank you for your suggestions, I have modified the configuration in the folder of/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1
1 In ufs.configure change MED_petlist_bounds: 0 100
to MED_petlist_bounds: 0 5999
.
2 ufs.configure stop_n = 408
model_configure you havenhours_fcst: 408
3 param.nml you have rnday=17.0
4 ww3_shel.nml you have domain%start = '20220915 000000'
and domain%stop = '20221002 000000'
After the modifications, the simulation is still stopped after 12 hours.
@yunfangsun Let me check. Have a meeting now
@yunfangsun The only thing I found, at this point, is that in your ufs.configure file stop_option
is set equal to nhours
which is set to 12. Can you change stop_option = nhours
to stop_option = 408
to see if it makes any difference?
Hi @pvelissariou1 ,
I have modified as follows:
ALLCOMP_attributes::
ScalarFieldCount = 3
ScalarFieldIdxGridNX = 1
ScalarFieldIdxGridNY = 2
ScalarFieldIdxNextSwCday = 3
ScalarFieldName = cpl_scalars
start_type = startup
restart_dir = RESTART/
case_name = ufs.cpld
restart_n = 12
restart_option = nhours
restart_ymd = -999
orb_eccen = 1.e36
orb_iyear = 2022
orb_iyear_align = 2022
orb_mode = fixed_year
orb_mvelp = 1.e36
orb_obliq = 1.e36
stop_n = 408
stop_option = 408
stop_ymd = -999
The job directly dropped after submission, the error message in PET0000.ESMF_LogFile is as follows:
20240618 133412.726 INFO PET0000 (med_phases_profile): done
20240618 133412.747 ERROR PET0000 (med_time_alarmInit): unknown option 408
20240618 133412.747 ERROR PET0000 med.F90:2327 Failure - Passing error in return code
20240618 133412.747 ERROR PET0000 MED:src/addon/NUOPC/src/NUOPC_ModelBase.F90:2108 Failure - Passing error in return code
20240618 133412.747 ERROR PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3702 Failure - Phase 'med_phases_prep_atm' Run for modelComp 4 did not return ESMF_SUCCESS
20240618 133412.747 ERROR PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3940 Failure - Passing error in return code
20240618 133412.747 ERROR PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3617 Failure - Passing error in return code
20240618 133412.747 ERROR PET0000 UFS.F90:411 Failure - Aborting UFS
The work folder is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1
@yunfangsun Ok, checking ...
@yunfangsun Can you replace stop_option = 408
with stop_option = stop_n
and re-run?
@yunfangsun Hold on for a while, need to check something
@yunfangsun It seems that there are a few options for nhours, and stop_option. I'll update you later.
@pvelissariou1
Thank you! I have changed back to stop_option = stop_n
, and submitted the job, and it is waiting for resources.
@yunfangsun Please cancel the job. stop_option = stop_n
is not a valid setup, I found out. I'll copy your run folder and I'll try to run the simulation. I'll keep you posted.
@pvelissariou1 thank you!
@yunfangsun , @saeed-moghimi-noaa , restart_n@janahaddad , Yunfang it seems that your issue is basically a CMEPS configuration issue, There are a few variables that control the length of the simulation (on the CMEPS side), like nhours, stop_n, stop_option, restart_n, ... I changed the value of nhours from 12 to 408
, and set restart_n = 24
(instead of 12) and the simulation stopped at 24 hours now (restart_n
is a CMEPS
variable). I didn't pay attention to all these before. Let me check a bit more and I'll get back to you. Anyway, the ufs.configure
and model_configure files (generated) need to be cleaned to contain variables relevant to ufs-coastal. Also when Ufuk comes back we need to have a discussion how to better generate these (and other configs) - UFSpy? Need to document all these and other settings thoroughly.
@yunfangsun In the ufs.configure
file the options:
restart_n = 12
restart_option = nhours
are responsible for this issue. The above instruct the mediator (CMEPS) to generate restart files, every 12 hours, for the coupling (not the model components). The options:
stop_n = 408
stop_option = nhours
inform the mediator that the simulation will end after 408 hours (17 days). Setting the above variables this way, it is expected that the simulation will go for 17 days, creating the CMEPS restart files every 12 hours; instead the simulation stops after 12 hours. I changed the restart_n, restart_options to:
restart_n = 24
restart_option = never
that is, no restart files are generated. Now the simulation proceeds as expected (as of this writting the simulation is 3 days in).
Valid values for both *_option variables are:
none,never,nsteps,nseconds,nminutes,nhours,ndays,nmonths,nyears,date,end
I have no idea why this is happening. Checked the CMEPS codes but I couldn't find anything. Is this a CMEPS bug? a misconfiguration in ufs.configure?
Anyway the full simulation results are in:
/work2/noaa/nos-surge/pvelissa/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1_T3
on hercules
@janahaddad It is 17 days of output
@janahaddad It is 17 days of output
@pvelissariou1 Yep was just making a quick note during tag-up that @yunfangsun's current run with your fix + Ufuk's prior suggestion of turning off MED history write sequence seems to be running ok, as his run now has 2+ days of output which he can use (even if it ends up timing out on Hercules) to compare with obs and/or gut-check with the SCHISM only and WW3 only tests
Maybe the pairs (history_n, history_option) and (restart_n, restart_option) bahave the same way. Switching them to "off" seems that fixes the current problem. May be should be removed or commented out completely from ufs.configure. Anyway, this is not the way it should be; it may need to be fixed from the ESMF side.
Update as of Friday 6/28:
https://github.com/oceanmodeling/ufs-weather-model/issues/7#issuecomment-1890110575
https://github.com/oceanmodeling/ufs-weather-model/issues/7#issuecomment-1890141078