ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
53 stars 114 forks source link

[develop] Feature/cicd metrics adds methods to collect resource usage data from major stages of the SRW pipeline build job #1058

Closed BruceKropp-Raytheon closed 3 months ago

BruceKropp-Raytheon commented 3 months ago

DESCRIPTION OF CHANGES:

Updated SRW Jenkinsfile with some run-time stats collection, and adds a final stage that triggers ufs-srw-metrics stats collection job for reporting metrics.

The SRW pipeline job that uses this Jenkinsfile will now use the 'time' command when executing major stages: init, build, test. This will collect CPU, Memory, and DiskUsage measurements that can be later used in trend plots on a metrics dashboard.

Additionally, it adds options to the pipeline job that allow the operator to select just a single test, or no test suite (default is still 'coverage' suite), and allows an option to select the depth of wrapper script tasks to execute during functional testing (default is still all 9 scripts).

Type of change

TESTS CONDUCTED:

Test in the usual fashion, which can perform 'coverage' or 'comprehensive' test suites, or optionally select 'none' as the test suite.

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

To further the objective described by https://jira.epic.oarcloud.noaa.gov/browse/ECC-1074

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

Code walk-through review with Edward Snyder

MichaelLueken commented 3 months ago

@BruceKropp-Raytheon -

Thanks for the explanation on how Jenkins sets default values with respect to the choice command! I was able to manually run both the Functional WorkflowTaskTests script (wrapper_srw_ftest.sh) and the Test script (srw_test.sh) without issue:

# Try hera with the first few simple SRW tasks ...
run_make_grid: COMPLETE
run_get_ics: COMPLETE
run_get_lbcs: COMPLETE
run_make_orog: COMPLETE
run_make_sfc_climo: COMPLETE
run_make_ics: COMPLETE
run_make_lbcs: COMPLETE
run_fcst: COMPLETE
run_post: COMPLETE
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240321200003                            COMPLETE              20.38
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024032  COMPLETE               6.18
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             761.50
get_from_HPSS_ics_HRRR_lbcs_RAP_20240321200008                     COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               6.06
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              12.71
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240321200011  COMPLETE              10.06
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               6.35
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             232.65
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240321  COMPLETE             305.64
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403212  COMPLETE             332.18
pregen_grid_orog_sfc_climo_20240321200016                          COMPLETE               7.04
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1715.14

Will approve this PR now.

EdwardSnyder-NOAA commented 3 months ago

I was able to run the main scripts from this pipeline manually (wrapper_srw_ftest.sh and srw_test.sh) on the PW AWS platform without any troubles. NOAA Cloud doesn't have a specific coverage suite, so I ran the fundamental suite with the srw_test.sh:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE             139.33
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              61.57
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              13.25
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              73.10
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024040  COMPLETE             258.98
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240408211  COMPLETE              78.36
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024040821112  COMPLETE             120.94
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             745.53

wrapper_srw_ftest.sh :

# Try noaacloud with the first few simple SRW tasks ...
run_make_grid: COMPLETE
run_get_ics: COMPLETE
run_get_lbcs: COMPLETE
run_make_orog: COMPLETE
run_make_sfc_climo: COMPLETE
run_make_ics: COMPLETE
run_make_lbcs: COMPLETE
run_fcst: COMPLETE
run_post: COMPLETE

All the other code checks out. Approving.

BruceKropp-Raytheon commented 3 months ago

resolved conflicts for nco_dir and jet specific customization

MichaelLueken commented 3 months ago

There had been issues on Orion due to bad nodes, but the latest run this morning has successfully passed. Merging this PR now.