hafs-community / HAFS

Hurricane Analysis and Forecast System
Other
29 stars 54 forks source link

Add restart capability in HAFS prep job #238

Open BinLiu-NOAA opened 9 months ago

BinLiu-NOAA commented 9 months ago

Description

Provide a clear and concise description of the requested feature/capability.

Please add restart capability for the HAFS atm_prep_mvnest job at next upgrade.

The WCOSS standard document requires any model run more than 15 minutes should have restart ability. Currently, the HAFS atm_prep_mvnest job runs avg > 55 mins. Please consider to add restart capability in the HAFS atm_prep_mvnest job.

Proposed solution

How should the new feature/capability be added? If you have thoughts on the implementation strategy, please share them here.

Status (optional)

Do you (or a colleague) plan to work on adding this feature?

Related to (optional)

Directly reference any related issues or PRs in this or other repositories, and describe how they are related. Examples:

BinLiu-NOAA commented 4 months ago

As for the atm_prep_mvnest job in HAFS application/workflow, it produces high resolution (with moving-nest resolution) geographical and surface climatology data for the entire parent domain. The only job depending upon this job's output is the forecast job. There is a time window of ~70 minutes for this atm_prep_mvnest job to run (from T+3:10 to T+4:20) before it could potentially affect/delay the forecast job's kick off time.

Currently in HAFSv1, this atm_prep_mvnest job uses 1 node (with 18PEs and OMPThreads of 6) and it takes ~50 minutes wallclock time. With the latest HAFSv2 package, we optimized and reduced the wallclock time down to ~40 minutes (still using 1 node, but with 6 PEs while 20 threads).

Based on the HAFSv2 EE2 kick off meeting/conversation/discussion with NCO SPAs, it is agreed that given this job only uses 1 node, and also have a long-time window to run (if the first try failed in the middle of the job, most likely the second retry can still complete in time for the forecast job to kick off on time) before it could potentially affect the HAFS application/workflow forecast job and product delivery time, it is agreed that we can leave this as is for the HAFSv2 upgrades.

Moving forward, we are considering several approaches to speed up this atm_prep_mvnest job. One is to speed up the serial executables by enabling OMP Threading. Another choice is to separate it into a few small jobs. And we can also consider adding the motioned/suggested RESTART capability for this atm_prep_mvnest job. For these, we will work toward in the next HAFS upgrade.