ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop] Update SRW with spack-stack version 1.5.0 (from 1.4.1) #969

Closed RatkoVasic-NOAA closed 7 months ago

RatkoVasic-NOAA commented 10 months ago

DESCRIPTION OF CHANGES:

Update SRW with spack-stack 1.5.0 Machines affected:

Type of change

TESTS CONDUCTED:

Fundamental tests performed.

DEPENDENCIES:

PR #973 and it's follow-up PR

ISSUE:

This solves issue #946 Issue to be solved: #991

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

gsketefian commented 10 months ago

@RatkoVasic-NOAA If this is going to update METplus, we should wait to merge this until PR #973 is in and another PR lined up after that (by me) which changes much of the METplus config files.

natalie-perlin commented 9 months ago

@MichaelLueken @RatkoVasic-NOAA - some updates on issues with Gaea-c5, where the runtime error occurs during the make_grid task (and likely the following ones)

Bringing in changes from release/public-v2.2.0 did not solve the problem (PR to Ratko's ss150 branch, https://github.com/RatkoVasic-NOAA/ufs-srweather-app/pull/4 ). The issue is indeed related to the changes where conda is being installed as a part of the SRW. The library named libstdc++.so.6 used for linking regional_esg_grid executable and the one needed for another conda library during a runtime - come from different locations/paths, which created a conflict during a runtime. The likely solution could be explicity specify the library path (using rpath?) when linking the executable.

I'm still looking for way to fix this issue.

More details below, in case someone had similar issues + quick solutions.

The library used during local conda install during the SRW build: libstdc++.so.6 => /lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/conda/lib/././libstdc++.so.6 (the directory /lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/ is equivalent to ./ufs-srweather-app/ )

The library used when building the executable: libstdc++.so.6 => /opt/cray/pe/gcc/10.3.0/snos/lib/../lib64/libstdc++.so.6

natalie-perlin commented 8 months ago

A PR https://github.com/RatkoVasic-NOAA/ufs-srweather-app/pull/5 is made to address the changes needed for Gaea-C5 and to fix a bug in devclean.sh script.

Gaea-c5 fundamental tests passed except the one that have been corrected later:
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              19.72
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              25.59
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              13.38
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  DEAD                  26.13
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              35.14
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              33.92
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              49.47
----------------------------------------------------------------------------------------------------

The grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot test after correction of the modulefile for plotting task:

(workflow_tools) [Natalie.Perlin@gaea55:/lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot]$ rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
201907010000               make_grid                   134695715           SUCCEEDED                   0         1          26.0
201907010000               make_orog                   134695724           SUCCEEDED                   0         1          61.0
201907010000          make_sfc_climo                   134695735           SUCCEEDED                   0         1          47.0
201907010000           get_extrn_ics                    77187460           SUCCEEDED                   0         1          22.0
201907010000          get_extrn_lbcs                    77187461           SUCCEEDED                   0         1          17.0
201907010000         make_ics_mem000                   134695744           SUCCEEDED                   0         1          69.0
201907010000        make_lbcs_mem000                   134695745           SUCCEEDED                   0         1          99.0
201907010000         run_fcst_mem000                   134695761           SUCCEEDED                   0         1         569.0
201907010000    run_post_mem000_f000                   134695771           SUCCEEDED                   0         1          24.0
201907010000    run_post_mem000_f001                   134695780           SUCCEEDED                   0         1          26.0
201907010000    run_post_mem000_f002                   134695781           SUCCEEDED                   0         1          25.0
201907010000    run_post_mem000_f003                   134695801           SUCCEEDED                   0         1          25.0
201907010000    run_post_mem000_f004                   134695802           SUCCEEDED                   0         1          26.0
201907010000    run_post_mem000_f005                   134695803           SUCCEEDED                   0         1          27.0
201907010000    run_post_mem000_f006                   134695804           SUCCEEDED                   0         1          24.0
201907010000            plot_allvars                   134695999           SUCCEEDED                   0         1         331.0
(workflow_tools) [Natalie.Perlin@gaea55:/lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot]
MichaelLueken commented 8 months ago

@RatkoVasic-NOAA - Thank you very much for making this quick change! The automated coverage tests all successfully pass now. Since the SRW App is now successfully building and running on Gaea C5, am I clear to change the status of this PR back to In Review and launch the automated comprehensive WE2E tests? Thanks!

RatkoVasic-NOAA commented 8 months ago

am I clear to change the status of this PR back to In Review and launch the automated comprehensive WE2E tests? Yes, please.

MichaelLueken commented 8 months ago

The automated comprehensive tests have been submitted. The pipeline can be found:

https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-969/2/pipeline

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA -

The ss150 branch failed to build on Derecho. The error is:

Lmod is automatically replacing "intel/2023.0.0" with "intel-classic/2023.0.0".

Lmod has detected the following error:  The following module(s) are unknown: "nemsio/2.5.2" "w3emc/2.10.0" "ip/4.3.0"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "nemsio/2.5.2" "w3emc/2.10.0" "ip/4.3.0"

Derecho is still using hpc-stack, rather than spack-stack. On Derecho, there is only w3emc/2.9.2, and ip/4.1.0. The modulefiles/srw_common.lua file will need to be updated to include these module versions.

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA -

On Gaea and Gaea C5, the Functional Workflow Task Tests failed.

The reason for the failure on Gaea C5 was due to jinja2 not being loaded. Looking in .cicd/scripts/srw_ftest.sh, I see the following: conda activate srw_app Similar to what you did in ush/load_modules_wflow.sh, please add logic to .cicd/scripts/srw_ftest.sh so that the automated tests will activate workflow_tools, rather than srw_app.

On Gaea, the run_make_sfc_climo task failed with rc=1. Looking in the log, at the very end, the task encountered a segfault. Please see /lustre/f2/dev/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-969/gaea/expt_dirs/test_community/run_make_sfc_climo-log.txt for more details.

RatkoVasic-NOAA commented 8 months ago

@MichaelLueken There were missing directories for year 2024 in Gaea-C4, /lustre/f2/darshan/2024/*/* I created those directories. Can you please try C4 again?

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA -

Gaea and Gaea C5 have successfully cleared the Functional Workflow Task Tests phase in the pipeline.

The pipeline was able to successfully build on Derecho. However, after passing the build, the Functional Workflow Task Tests phase is now failing. The error message can be seen in the pipeline:

https://jenkins.epic.oarcloud.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-969/3/pipeline/213

mkstemp: No such file or directory
qsub: could not create/open tmp file /glade/scratch/epicufsrt/.tmp/pbsscrpt3ZukW7

It isn't clear what the issue is. Additionally, there are no output files, so it is harder to try and figure out the reason for the failure.

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA -

My attempt to replicate the failure from the automated Jenkins tests didn't pan out - the Workflow Task Tests script worked when I manually ran them in my own directory:

# Try derecho with the first few simple SRW tasks ...
run_make_grid: COMPLETE
run_get_ics: COMPLETE
run_get_lbcs: COMPLETE
run_make_orog: COMPLETE
run_make_sfc_climo: COMPLETE
run_make_ics: COMPLETE
run_make_lbcs: COMPLETE
run_fcst: COMPLETE
run_post: COMPLETE

I will retry submitting the test in the morning (hopefully the rest of the comprehensive tests will complete by that time). If it continues to fail, I will manually run the automated test script on Derecho so that we can move forward with this work.

The comprehensive tests have successfully passed on Hercules.

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA - Here is the status report for the Jenkins tests:

The Jet tests are still running, but it looks like they will successfully pass without issue.

The Gaea C5 comprehensive tests have all passed successfully.

The Hercules comprehensive tests have all passed successfully.

Unfortunately, there were several failures while running the comprehensive tests on Orion. All of the verification WE2E tests have failed. The following error message is found in the log files:

Loading modules for task "run_vx" ...
Lmod has detected the following error: These module(s) or extension(s) exist
but cannot be loaded as requested: "python/3.10.8"
   Try: "module spider python/3.10.8" to see how to load the module(s).

It's not clear to me why the App is encountering issues with loading python/3.10.8 on Orion. Attempting to add:

prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core")
prepend_path("MODULEPATH", "/work/noaa/da/role-da/spack-stack/modulefiles")

load("stack-intel/2022.0.2")

directly to modulefiles/tasks/orion/run_vx.local.lua doesn't correct the issue either.

There was a single failure in the comprehensive tests on Gaea. The grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test failed in the run_fcst task step with the log indicating a bus error:

srun: error: nid00026: task 664: Bus error (core dumped)
srun: Terminating StepId=269461812.0
slurmstepd: error: *** STEP 269461812.0 ON nid00000 CANCELLED AT 2024-01-02T21:43:12 ***

I suspect that a rerun would allow the test to successfully pass.

I went ahead and manually ran the comprehensive tests on Derecho and there were only four failures:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD                                                           COMPLETE              34.55
community                                                          COMPLETE              41.93
custom_ESGgrid                                                     COMPLETE              15.22
custom_ESGgrid_Central_Asia_3km                                    DEAD                   0.83
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              23.09
custom_ESGgrid_NewZealand_3km                                      DEAD                   0.99
custom_ESGgrid_Peru_12km                                           COMPLETE              23.11
custom_ESGgrid_SF_1p1km                                            COMPLETE             147.06
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE      COMPLETE              11.43
custom_GFDLgrid                                                    COMPLETE              10.66
deactivate_tasks                                                   COMPLETE               0.98
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             693.23
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS                             COMPLETE              21.32
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              16.83
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta   DEAD                   2.12
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot        DEAD                   0.73
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR                 COMPLETE             170.42
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              33.14
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              36.34
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              32.90
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              32.57
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              12.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              26.35
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              26.91
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR             COMPLETE              37.48
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP              COMPLETE              78.53
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta      COMPLETE              39.30
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              19.22
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE              12.43
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              44.65
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta            COMPLETE              35.14
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             228.78
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             310.50
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             308.63
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR              COMPLETE             334.54
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta       COMPLETE             335.57
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              32.53
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              28.51
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              27.58
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16   COMPLETE              20.58
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              31.99
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              17.77
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16    COMPLETE             254.15
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             270.10
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta     COMPLETE             270.79
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP                 COMPLETE              79.01
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              31.90
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              40.62
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              31.29
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16               COMPLETE              50.24
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot      COMPLETE              16.92
MET_ensemble_verification_only_vx                                  COMPLETE               0.86
MET_verification_only_vx                                           COMPLETE               0.17
nco                                                                COMPLETE              20.00
nco_ensemble                                                       COMPLETE             113.73
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16      COMPLETE              31.06
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              23.56
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             304.79
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR       COMPLETE              26.90
pregen_grid_orog_sfc_climo                                         COMPLETE              13.33
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS                              COMPLETE              11.83
specify_template_filenames                                         COMPLETE              14.62
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                4965.05

The failures are caused by issues in make_sfc_climo in UFS_UTILS. Additional work will be required to make the UFS_UTILS repository work properly on Derecho. These are known issues (please see issue #947 for more details).

natalie-perlin commented 8 months ago

@MichaelLueken - It looks like UFS-WM now supports Derecho and uses spack-stack/1.5.0 for intel and spack-stack/1.5.1 for gnu compilers. Could it be a good place (this PR) to update Derecho to spack-stack/1.5.0 for the SRW as well?

MichaelLueken commented 8 months ago

@natalie-perlin - Since this PR is intended to update the spack-stack version to 1.5.0, it would be fine to include transitioning Derecho to spack-stack from hpc-stack, especially since the UFS-WM has made this change as well. With the current issue being encountered on Orion (the verification WE2E tests are failing due to the inability to load python/3.10.8, there is time to try and make spack-stack v1.5.0 work on Derecho. Thanks!

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA -

I have made some progress on the verification issue on Orion. When I added load("build_orion_intel") to the top of modulefiles/tasks/orion/run_vx.local.lua, I was able to successfully load the task modulefile. Following this, I created a test file containing all of the tests that had failed initially:

grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
MET_ensemble_verification_only_vx
MET_verification_only_vx

Two tests still failed with this setup:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE              11.44
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              22.21
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              28.82
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             330.09
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0         COMPLETE              14.87
MET_ensemble_verification_only_vx                                  COMPLETE               1.33
MET_verification_only_vx                                           COMPLETE               0.28
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             409.04
MichaelLueken commented 8 months ago

A rerun of the Jenkins tests on Derecho shows that the tests continue to fail in the Workflow Tasks Test phase. Since the Workflow Tasks Test script runs fine manually, I have opened PSD-69 with the Platform team to see if they can think of any reason for this failure. At this point, I can only assume that there is an issue with the epicufsrt environment on Derecho.

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA and @natalie-perlin -

The issue on Derecho has been identified. The /glade/u/home/epicufsrt/.bashrc file hasn't been updated to transition from Cheyenne to Derecho. Within this file, the TMPDIR variable is being set to /glade/scratch/epicufsrt/.tmp. This location doesn't exist on Derecho. It needs to be set to /glade/derecho/scratch/epicufsrt/.tmp. This is the culprit for the failed Workflow Task Tests on the machine (and likely the failure of the standard WE2E tests, if the Workflow Task Tests were deactivated for Derecho).

Unfortunately, I don't have access to the epicufsrt role account on Derecho. I have reached out to Jong and the Platform team to see if they can correct the entry in the .bashrc file, but I don't know if they can. If push comes to shove, we might need to add:

export TMPDIR=/glade/derecho/scratch/epicufsrt/.tmp

directly into .cicd/scripts/wrapper_srw_ftest.sh. I'll let you know what happens.

RatkoVasic-NOAA commented 8 months ago

@MichaelLueken Somebody already changed it in .bashrc file. :-)

MichaelLueken commented 8 months ago

@RatkoVasic-NOAA - Jong was able to get in and make the necessary change. I'm rerunning the pipeline on Derecho to make sure that everything behaves as expected now.

natalie-perlin commented 7 months ago

@MichaelLueken @RatkoVasic-NOAA -

The CMake issue with Derecho is solved, and fundamental tests except for met verification tasks pass. Please see a PR into Ratkos branch ss150: https://github.com/RatkoVasic-NOAA/ufs-srweather-app/pull/9

Errors in met tasks are such as these:

/glade/derecho/scratch/nperlin/SRW/srw-ss150-upd/scripts/exregional_run_met_pcpcombine.sh: line 362: uw: command not found

Maybe somebody more familiar with theuwcould provide some help to solve these these issues for Derecho?

MichaelLueken commented 7 months ago

@natalie-perlin - I was able to get the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 WE2E test to run on Derecho by adding:

load("conda")
setenv("SRW_ENV", "srw_app")

to the end of the modulefiles/tasks/derecho/run_vx.local.lua file. Attempting to just add:

load("python_srw")

to the end of the task modulefile resulted in failures associated with python/3.10.8. The above method loads conda and the necessary conda environment (srw_app) to run the verification tasks.

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012309205  COMPLETE              51.27
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE              51.27
--[[
Compiler-specific modules are used for met and metplus libraries
--]]

local met_ver = (os.getenv("met_ver") or "11.1.0")
local metplus_ver = (os.getenv("metplus_ver") or "5.1.0")
if (mode() == "load") then
  load(pathJoin("met", met_ver))
  load(pathJoin("metplus",metplus_ver))
end
local base_met = os.getenv("met_ROOT") or os.getenv("MET_ROOT")
local base_metplus = os.getenv("metplus_ROOT") or os.getenv("METPLUS_ROOT")

setenv("MET_INSTALL_DIR", base_met)
setenv("MET_BIN_EXEC",    pathJoin(base_met,"bin"))
setenv("MET_BASE",        pathJoin(base_met,"share/met"))
setenv("MET_VERSION",     met_ver)
setenv("METPLUS_VERSION", metplus_ver)
setenv("METPLUS_ROOT",    base_metplus)
setenv("METPLUS_PATH",    base_metplus)

if (mode() == "unload") then
  unload(pathJoin("met", met_ver))
  unload(pathJoin("metplus",metplus_ver))
end
load("conda")
setenv("SRW_ENV", "srw_app")
natalie-perlin commented 7 months ago

@MichaelLueken @RatkoVasic-NOAA - confirming, all fundamental tests pass successfully now:

Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              20.29
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              26.21
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              16.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              31.57
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012  COMPLETE              36.74
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240124072  COMPLETE              33.92
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012407234  COMPLETE              50.45
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             215.46

Detailed summary written to /glade/derecho/scratch/nperlin/SRW/expt_dirs/WE2E_summary_20240124075353.txt
MichaelLueken commented 7 months ago

@natalie-perlin and @RatkoVasic-NOAA - Would you like for me to go ahead and add a PR to ss150 to correct the verification issues on the machine?

RatkoVasic-NOAA commented 7 months ago

@MichaelLueken yes, please! I just saw Natalie's PR. Was that the same one you planned to do?

MichaelLueken commented 7 months ago

@MichaelLueken yes, please! I just saw Natalie's PR. Was that the same one you planned to do?

@RatkoVasic-NOAA - Yes, the update from Natalie's PR would have been the same as the PR I would have created. Also, reaching out to Jet sys admins, I was able to find a fix to allow the service partition to work once again. I will go ahead and open a PR into ss150 with the necessary modifications. Once this has been merged, I will go ahead and launch the comprehensive tests on Derecho and Jet, then we can move forward with this PR. Thanks!

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - While running a quick test with the updates for Jet, I noted that the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 test was failing due to not knowing what to do with uw in the verification scripts. I went ahead and made the necessary change, similar to what was done on Derecho, to the modulefiles/tasks/jet/run_vx.local.lua file, as well as all machine run_vx.local.lua files that don't load python_srw. Once my current set of tests show that the tasks run, I will commit and open the PR into ss150.

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - I have kicked off the comprehensive tests for Derecho and Jet. Once they complete, I will move forward with merging this PR. Thanks!

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - All comprehensive tests have successfully passed on Derecho:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD_20240125074415                                            COMPLETE              38.50
community_20240125074418                                           COMPLETE              44.32
custom_ESGgrid_20240125074419                                      COMPLETE              17.90
custom_ESGgrid_IndianOcean_6km_20240125074420                      COMPLETE              26.31
custom_ESGgrid_Peru_12km_20240125074422                            COMPLETE              25.84
custom_ESGgrid_SF_1p1km_20240125074423                             COMPLETE             156.42
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE              13.63
custom_GFDLgrid_20240125074426                                     COMPLETE              12.85
deactivate_tasks_20240125074427                                    COMPLETE               1.19
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             683.30
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240125074430              COMPLETE              23.33
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              20.15
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125074433  COMPLETE             179.04
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125074  COMPLETE              38.04
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              40.30
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012507  COMPLETE              37.37
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              36.82
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              15.25
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              28.71
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              30.42
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012507  COMPLETE              39.51
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125074  COMPLETE              88.66
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              43.18
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240125074449  COMPLETE              21.71
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE              14.94
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012507445  COMPLETE              49.53
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202401250  COMPLETE              38.04
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202401  COMPLETE             230.42
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             327.41
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240125  COMPLETE             318.35
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125074  COMPLETE             347.89
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             343.14
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              35.61
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240125  COMPLETE              33.26
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              32.20
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              22.50
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012  COMPLETE              35.76
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              19.85
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             259.44
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202401250  COMPLETE             279.74
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             276.85
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125074518  COMPLETE              85.63
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202401  COMPLETE              34.23
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012507452  COMPLETE              45.11
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240125074  COMPLETE              33.57
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401250745  COMPLETE              52.32
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              19.10
MET_ensemble_verification_only_vx_20240125074528                   COMPLETE               1.40
MET_ensemble_verification_winter_wx_20240125074531                 COMPLETE             236.10
MET_verification_only_vx_20240125074534                            COMPLETE               0.35
nco_20240125074536                                                 COMPLETE              21.82
nco_ensemble_20240125074538                                        COMPLETE             133.71
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              35.15
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              25.56
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             323.29
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              29.05
pregen_grid_orog_sfc_climo_20240125074549                          COMPLETE              16.54
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240125074551               COMPLETE              14.31
specify_template_filenames_20240125074553                          COMPLETE              17.31
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            5452.23

The final Jet comprehensive tests have started now.

RatkoVasic-NOAA commented 7 months ago

@MichaelLueken , @ulmononian caught one thing. We forget to change file system on build and workflow scripts for Gaea-C5 from F2 to F5. I just committed these two changes. Do you think there's any other place (like in Jenkins)?

MichaelLueken commented 7 months ago

@MichaelLueken , @ulmononian caught one thing. We forget to change file system on build and workflow scripts for Gaea-C5 from F2 to F5. I just committed these two changes. Do you think there's any other place (like in Jenkins)?

@RatkoVasic-NOAA - While queuing up the Jenkins tests this morning, it has come to my attention that Gaea C5 is no longer using the gaea-c5 label, but gaeac5. I'm unsure if this will also require a renaming of the gaea-c5 modulefiles to gaeac5, or if the SRW_PLATFORM setting is still being set as gaea-c5 .

RatkoVasic-NOAA commented 7 months ago

@MichaelLueken I found two more files pointing to old file system. As for name of machine, I don't think we changed anything, so I believe it will work with gaea-c5. But if consensus is to go without hyphen, I'm OK in changing it everywhere.

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - There are no longer any nodes associated with gaea-c5 in Jenkins. In order to run Jenkins on Gaea C5 moving forward, we will need to set gaeac5 in .cicd/Jenkinsfile.

RatkoVasic-NOAA commented 7 months ago

@MichaelLueken I'm looking into UFS WM PRs, and they are changing Gaea's name from gaea-c5 to just gaea. Can you check with whomever changed by just taking off hyphen if they also agree on having just gaea? It would be great to have same name across all applications.

ulmononian commented 7 months ago

@MichaelLueken I'm looking into UFS WM PRs, and they are changing Gaea's name from gaea-c5 to just gaea. Can you check with whomever changed by just taking off hyphen if they also agree on having just gaea? It would be great to have same name across all applications.

@jkbk2004 @zach1221 do you know how this call was made for the ufs-wm?

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - I'm using the SRW_App_Jenkinsfile_test sandbox on Jenkins to test the changes on Gaea C5. If any changes are required, I will open one final PR to your ss150 branch to address the issues.

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA and @ulmononian - Talking with Zach, the change in PR #2115 is only for manual runs of the UFS WM regression tests. The Jenkinsfile is still pointing to gaeac5 for the UFS WM Jenkins tests.

MichaelLueken commented 7 months ago

Additional information from Kris Booker:

As for right now Jenkins is referring to Gaea as node name GaeaC5 with a label name of 'gaeac5'. This was due to an issue with UFS WM pipeline.

So, for the purposes of the Jenkinsfile, renaming the gaea-c5 label to gaeac5 is the correct method.

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA and @natalie-perlin - I have made the necessary modifications to allow the SRW App to successfully build on Gaea C5, but while attempting to run the WE2E coverage tests, the tests are all failing in make_grid with the following error message:

/gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/install_intel/exec/regional_esg_grid: symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d

I'll continue to dig around and see what might be happening, but I would appreciate any assistance you can provide, especially if this error message was encountered on other machines transitioning to spack-stack v.1.50.

My forked branch of @RatkoVasic-NOAA's ss150 branch can be found https://github.com/MichaelLueken/ufs-srweather-app/tree/ss150

MichaelLueken commented 7 months ago

@natalie-perlin - Making the changes to modulefiles/wflow_gaea-c5.lua likely caused the issue. I forgot that Gaea C5 requires the old workflow_tools conda environment to work. I'm working on correcting this in my branch, as well as updating the devbuild.sh script to replace gaea-c5 with gaeac5, and then rebuild and rerun the tests.

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - There were four WE2E comprehensive tests that failed on Jet:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
2020_CAD_20240125173400                                            COMPLETE              37.68
community_20240125173401                                           COMPLETE              17.23
custom_ESGgrid_20240125173403                                      COMPLETE              20.23
custom_ESGgrid_Central_Asia_3km_20240125173404                     COMPLETE              35.80
custom_ESGgrid_Great_Lakes_snow_8km_20240125173405                 COMPLETE              13.14
custom_ESGgrid_IndianOcean_6km_20240125173407                      COMPLETE              17.01
custom_ESGgrid_NewZealand_3km_20240125173408                       COMPLETE              71.01
custom_ESGgrid_Peru_12km_20240125173409                            COMPLETE              21.22
custom_ESGgrid_SF_1p1km_20240125173411                             COMPLETE             219.84
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE               8.84
custom_GFDLgrid_20240125173413                                     COMPLETE               9.03
deactivate_tasks_20240125173414                                    COMPLETE               0.76
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE            1015.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024012  COMPLETE               6.60
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200_202401  COMPLETE               9.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202401  COMPLETE               9.08
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              50.75
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE            1073.64
get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS_20240125173423                COMPLETE               6.76
get_from_HPSS_ics_HRRR_lbcs_RAP_20240125173424                     COMPLETE              13.18
get_from_HPSS_ics_RAP_lbcs_RAP_20240125173426                      COMPLETE              15.41
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240125173427              DEAD                  12.19
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              19.23
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             361.94
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             166.24
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173432  COMPLETE             231.01
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173  COMPLETE              36.20
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              42.18
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517  COMPLETE              39.96
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              39.95
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              10.31
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              16.53
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              14.47
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517  COMPLETE              15.93
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173  COMPLETE              43.60
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              18.64
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240125173447  COMPLETE              10.07
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               7.01
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012517345  COMPLETE              18.43
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202401251  COMPLETE              14.50
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202401  COMPLETE             328.71
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE            3282.92
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240125  COMPLETE             419.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173  COMPLETE             514.55
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             520.15
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              33.34
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240125  COMPLETE              31.89
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              30.89
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              10.80
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012  COMPLETE              24.02
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               8.62
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             365.49
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202401251  COMPLETE             434.65
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             444.76
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173513  COMPLETE              95.37
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202401  COMPLETE              20.92
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012517351  COMPLETE              23.69
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240125173  COMPLETE              20.38
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401251735  COMPLETE              29.89
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              12.31
long_fcst_20240125173522                                           COMPLETE              63.52
MET_ensemble_verification_only_vx_20240125173523                   COMPLETE               1.33
MET_ensemble_verification_only_vx_time_lag_20240125173526          DEAD                   4.41
MET_ensemble_verification_winter_wx_20240125173528                 COMPLETE             118.67
MET_verification_only_vx_20240125173531                            COMPLETE               0.27
nco_20240125173533                                                 COMPLETE               7.73
nco_ensemble_20240125173535                                        COMPLETE              73.55
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              33.75
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  DEAD                   3.46
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             434.71
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.79
pregen_grid_orog_sfc_climo_20240125173546                          COMPLETE               8.81
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240125173548               COMPLETE               7.08
specify_template_filenames_20240125173549                          DEAD                   6.47
----------------------------------------------------------------------------------------------------
Total                                                              DEAD               11216.74

The Jenkins working directory on Jet is /mnt/lfs1/NAGAPE/epic/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-969/jet/expt_dirs.

The get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS test failed in the make_ics task due to a bad allocation. I suspect a rerun will allow this test to pass.

The MET_ensemble_verification_only_vx_time_lag test failed in the run_MET_PcpCombine_fcst_APCP0*h_mem00* tasks due to OBS_DIR does not exist or is not a directory. The tasks that pulled the necessary data successfully completed, so hopefully a rerun will work here as well.

The nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16 test failed in the make_lbcs task due to OOM kill events. A rerun should work.

The specify_template_filenames test failed in the run_fcst_mem000 task due to CFL violation. Hopefully a rerun will correct this.

I'm resubmitting the failed jobs now and will let you know how they look.

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - Three of the tests that had failed are now successfully passing on Jet:

  expt_name = "MET_ensemble_verification_only_vx_time_lag"
  wflow_status = "SUCCESS"
  expt_name = "get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS"
  wflow_status = "SUCCESS"
  expt_name = "nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16"
  wflow_status = "SUCCESS"

The specify_template_filenames test is still failing in run_fcst with:

FATAL from PE 1: compute_qs: saturation vapor pressure table overflow, nbad= 1

It doesn't make sense to me that a single test is encountering a CFL violation and not the rest of the tests that also use FV3_GFS_v15p2 SDF.

For Gaea C5, running the WE2E fundamental tests on the machine fail in make_grid with:

/gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/install_intel/exec/regional_esg_grid: symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d

This is using the BUILD_CONDA option to create the srw_app conda environment on the machine. While applying the necessary changes that @natalie-perlin made to allow the WE2E tests to run on Gaea C5 with the F2 filesystem (not using the BUILD_CONDA option and using the old workflow_tools conda environment instead), the WE2E tests will fail to generate because uwtools isn't in the workflow_tools conda environment. Following the merging of PR #994, uwtools needs to be in the conda environment, otherwise the templater tool will fail to generate.

RatkoVasic-NOAA commented 7 months ago

@MichaelLueken I'm running now 'specify_template_filenames' test on Jet. Last time I ran it it worked. I'll start from beginning.

MichaelLueken commented 7 months ago

@RatkoVasic-NOAA - After many rocotorewind/rocotoboot to run_fcst in specify_template_filenames, all of the WE2E comprehensive tests successfully passed on Jet:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD_20240125173400                                            COMPLETE              37.68
community_20240125173401                                           COMPLETE              17.23
custom_ESGgrid_20240125173403                                      COMPLETE              20.23
custom_ESGgrid_Central_Asia_3km_20240125173404                     COMPLETE              35.80
custom_ESGgrid_Great_Lakes_snow_8km_20240125173405                 COMPLETE              13.14
custom_ESGgrid_IndianOcean_6km_20240125173407                      COMPLETE              17.01
custom_ESGgrid_NewZealand_3km_20240125173408                       COMPLETE              71.01
custom_ESGgrid_Peru_12km_20240125173409                            COMPLETE              21.22
custom_ESGgrid_SF_1p1km_20240125173411                             COMPLETE             219.84
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE               8.84
custom_GFDLgrid_20240125173413                                     COMPLETE               9.03
deactivate_tasks_20240125173414                                    COMPLETE               0.76
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE            1015.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024012  COMPLETE               6.60
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200_202401  COMPLETE               9.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202401  COMPLETE               9.08
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              50.75
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE            1073.64
get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS_20240125173423                COMPLETE               6.76
get_from_HPSS_ics_HRRR_lbcs_RAP_20240125173424                     COMPLETE              13.18
get_from_HPSS_ics_RAP_lbcs_RAP_20240125173426                      COMPLETE              15.41
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240125173427              COMPLETE              13.65
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              19.23
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             361.94
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             166.24
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173432  COMPLETE             231.01
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173  COMPLETE              36.20
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              42.18
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517  COMPLETE              39.96
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              39.95
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              10.31
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              16.53
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              14.47
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517  COMPLETE              15.93
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173  COMPLETE              43.60
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              18.64
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240125173447  COMPLETE              10.07
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               7.01
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012517345  COMPLETE              18.43
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202401251  COMPLETE              14.50
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202401  COMPLETE             328.71
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE            3282.92
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240125  COMPLETE             419.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173  COMPLETE             514.55
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             520.15
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              33.34
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240125  COMPLETE              31.89
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              30.89
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              10.80
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012  COMPLETE              24.02
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               8.62
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             365.49
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202401251  COMPLETE             434.65
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             444.76
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173513  COMPLETE              95.37
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202401  COMPLETE              20.92
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012517351  COMPLETE              23.69
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240125173  COMPLETE              20.38
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401251735  COMPLETE              29.89
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              12.31
long_fcst_20240125173522                                           COMPLETE              63.52
MET_ensemble_verification_only_vx_20240125173523                   COMPLETE               1.33
MET_ensemble_verification_only_vx_time_lag_20240125173526          COMPLETE               6.26
MET_ensemble_verification_winter_wx_20240125173528                 COMPLETE             118.67
MET_verification_only_vx_20240125173531                            COMPLETE               0.27
nco_20240125173533                                                 COMPLETE               7.73
nco_ensemble_20240125173535                                        COMPLETE              73.55
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              33.75
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              14.34
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             434.71
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.79
pregen_grid_orog_sfc_climo_20240125173546                          COMPLETE               8.81
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240125173548               COMPLETE               7.08
specify_template_filenames_20240125173549                          COMPLETE              11.27
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE           11235.73
MichaelLueken commented 7 months ago

On Gaea C5, while compiling, I am seeing the following messages, which didn't appear before the transition to the F5 filesystem:

[  0%] Building Fortran object sorc/emcsfc_ice_blend.fd/CMakeFiles/emcsfc_ice_blend.dir/emcsfc_ice_blend.f90.o
No supported cpu target is set, CRAY_CPU_TARGET=x86-64 will be used.
Load a valid targeting module or set CRAY_CPU_TARGET

Additionally, I'm trying to see if the modifications to etc/lmod-setup.sh and etc/lmod-setup.csh (replacing the calls to source /lustre/f2/dev/role.epic/contrib/Lmod_init_C5.sh and source /lustre/f2/dev/role.epic/contrib/Lmod_init_C5.csh, respectively, with module reset) might be causing issues.

It's unclear why the regional_esg_grid executable would encounter symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d.