ufs-community / ufs-weather-model

UFS Weather Model
Other
134 stars 243 forks source link

cpld_control_p8 regression test cannot be run for a longer time (multiple days) #2134

Open cenlinhe opened 6 months ago

cenlinhe commented 6 months ago

Description

I tried to test the default ufs-weather-model (https://github.com/ufs-community/ufs-weather-model) for the fully coupled case (cpld_control_p8) run for multiple days (e.g., 10 days). After the successfully regression test for this cpld_control_p8 configuration under /tests/ directory, I went to the generated run folder and change the nhours_fcst from 24 hr to 240 hr in the model_configure file, and then I also changed the stop_n from 24 hr to 240 hr for ALLCOMP_attributes in the ufs.configure file. Then, I resubmit the job_card. However, the model hangs after 42-hr forecasts without crashing or generating any new outputs. The walltime is still counting. I ran the model on NCAR Derecho HPC.

In my err file, it showed something like: pe=00042 FAIL at line=02510 ExtDataGridCompMod.F90 pe=00042 FAIL at line=01394 ExtDataGridCompMod.F90 pe=00042 FAIL at line=01901 MAPL_Generic.F90 pe=00042 FAIL at line=01243 MAPL_CapGridComp.F90 pe=00042 FAIL at line=01206 MAPL_CapGridComp.F90 pe=00042 FAIL at line=01166 MAPL_CapGridComp.F90 pe=00042 FAIL at line=00834 MAPL_CapGridComp.F90 pe=00042 FAIL at line=00974 MAPL_CapGridComp.F90

In my PET log file, it showed something like: 20240218 192423.615 INFO PET042 UFS Aerosols: Advancing from 2021-03-24T00:00:00 to 2021-03-24T00:12:00 20240218 192423.619 ERROR PET042 Aerosol_Cap.F90:464 Failure - Passing error in return code 20240218 192423.619 ERROR PET042 CHM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:2218 Failure - Passing error in return code 20240218 192423.619 ERROR PET042 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3702 Failure - Phase 'RunPhase1' Run for modelComp 3 did not return ESMF_SUCCESS 20240218 192423.619 ERROR PET042 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3940 Failure - Passing error in return code 20240218 192423.619 ERROR PET042 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:3617 Failure - Passing error in return code 20240218 192423.619 ERROR PET042 UFS.F90:403 Failure - Aborting UFS

It seems that chemistry component is failing. I tested another configuration of cpld_control_noaero_p8, which works successfully for a longer time run (e.g., 10 days).

So is this related to some missing chemistry input files in cpld_control_p8 setup that prevents a longer time run, or is there any potential bug? How to solve this to allow running cpld_control_p8 test case for a longer time?

Thank you!

To Reproduce:

I am running the usf-weather-model (https://github.com/ufs-community/ufs-weather-model) on NCAR Derecho HPC using intel compiler following the default UFS Derecho setup. To reproduce the behavior.

  1. run the default regression test for cpld_control_p8;
  2. then go to run directory and change the nhours_fcst from 24 hr to 240 hr in the model_configure file, and also change the stop_n from 24 hr to 240 hr for ALLCOMP_attributes in the ufs.configure file
  3. then resubmit the job to run.
  4. The model will hang after 42-hr forecast without crashing or generating any new outputs.
DeniseWorthen commented 6 months ago

@cenlinhe You've correctly diagnosed the failure being due to the lack of input data. See also https://github.com/ufs-community/ufs-weather-model/issues/1204.

Do you need the aerosol component active? If yes, I would suggest you use the global workflow.

cenlinhe commented 6 months ago

Thank you for the clarification!

jkbk2004 commented 6 months ago

@cenlinhe Keep me posted if the coupled extended run is needed. On EPIC side, we are trying to reset a 35 day forecast run as our weekly test case.