Closed RatkoVasic-NOAA closed 7 months ago
@RatkoVasic-NOAA If this is going to update METplus, we should wait to merge this until PR #973 is in and another PR lined up after that (by me) which changes much of the METplus config files.
@MichaelLueken @RatkoVasic-NOAA -
some updates on issues with Gaea-c5, where the runtime error occurs during the make_grid task
(and likely the following ones)
Bringing in changes from release/public-v2.2.0 did not solve the problem (PR to Ratko's ss150 branch, https://github.com/RatkoVasic-NOAA/ufs-srweather-app/pull/4 ). The issue is indeed related to the changes where conda is being installed as a part of the SRW. The library named libstdc++.so.6 used for linking regional_esg_grid executable and the one needed for another conda library during a runtime - come from different locations/paths, which created a conflict during a runtime. The likely solution could be explicity specify the library path (using rpath?) when linking the executable.
I'm still looking for way to fix this issue.
More details below, in case someone had similar issues + quick solutions.
The library used during local conda install during the SRW build: libstdc++.so.6 => /lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/conda/lib/././libstdc++.so.6 (the directory /lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/srw-ss150/ is equivalent to ./ufs-srweather-app/ )
The library used when building the executable: libstdc++.so.6 => /opt/cray/pe/gcc/10.3.0/snos/lib/../lib64/libstdc++.so.6
A PR https://github.com/RatkoVasic-NOAA/ufs-srweather-app/pull/5 is made to address the changes needed for Gaea-C5 and to fix a bug in devclean.sh script.
Gaea-c5 fundamental tests passed except the one that have been corrected later:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 19.72
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 25.59
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 COMPLETE 13.38
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot DEAD 26.13
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR COMPLETE 35.14
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0 COMPLETE 33.92
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 COMPLETE 49.47
----------------------------------------------------------------------------------------------------
The grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
test after correction of the modulefile for plotting task:
(workflow_tools) [Natalie.Perlin@gaea55:/lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot]$ rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10
CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
================================================================================================================================
201907010000 make_grid 134695715 SUCCEEDED 0 1 26.0
201907010000 make_orog 134695724 SUCCEEDED 0 1 61.0
201907010000 make_sfc_climo 134695735 SUCCEEDED 0 1 47.0
201907010000 get_extrn_ics 77187460 SUCCEEDED 0 1 22.0
201907010000 get_extrn_lbcs 77187461 SUCCEEDED 0 1 17.0
201907010000 make_ics_mem000 134695744 SUCCEEDED 0 1 69.0
201907010000 make_lbcs_mem000 134695745 SUCCEEDED 0 1 99.0
201907010000 run_fcst_mem000 134695761 SUCCEEDED 0 1 569.0
201907010000 run_post_mem000_f000 134695771 SUCCEEDED 0 1 24.0
201907010000 run_post_mem000_f001 134695780 SUCCEEDED 0 1 26.0
201907010000 run_post_mem000_f002 134695781 SUCCEEDED 0 1 25.0
201907010000 run_post_mem000_f003 134695801 SUCCEEDED 0 1 25.0
201907010000 run_post_mem000_f004 134695802 SUCCEEDED 0 1 26.0
201907010000 run_post_mem000_f005 134695803 SUCCEEDED 0 1 27.0
201907010000 run_post_mem000_f006 134695804 SUCCEEDED 0 1 24.0
201907010000 plot_allvars 134695999 SUCCEEDED 0 1 331.0
(workflow_tools) [Natalie.Perlin@gaea55:/lustre/f2/scratch/ncep/Natalie.Perlin/C5/SRW/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot]
@RatkoVasic-NOAA - Thank you very much for making this quick change! The automated coverage tests all successfully pass now. Since the SRW App is now successfully building and running on Gaea C5, am I clear to change the status of this PR back to In Review and launch the automated comprehensive WE2E tests? Thanks!
am I clear to change the status of this PR back to In Review and launch the automated comprehensive WE2E tests? Yes, please.
The automated comprehensive tests have been submitted. The pipeline can be found:
@RatkoVasic-NOAA -
The ss150
branch failed to build on Derecho. The error is:
Lmod is automatically replacing "intel/2023.0.0" with "intel-classic/2023.0.0".
Lmod has detected the following error: The following module(s) are unknown: "nemsio/2.5.2" "w3emc/2.10.0" "ip/4.3.0"
Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore_cache load "nemsio/2.5.2" "w3emc/2.10.0" "ip/4.3.0"
Derecho is still using hpc-stack, rather than spack-stack. On Derecho, there is only w3emc/2.9.2
, and ip/4.1.0
. The modulefiles/srw_common.lua
file will need to be updated to include these module versions.
@RatkoVasic-NOAA -
On Gaea and Gaea C5, the Functional Workflow Task Tests failed.
The reason for the failure on Gaea C5 was due to jinja2
not being loaded. Looking in .cicd/scripts/srw_ftest.sh
, I see the following:
conda activate srw_app
Similar to what you did in ush/load_modules_wflow.sh
, please add logic to .cicd/scripts/srw_ftest.sh
so that the automated tests will activate workflow_tools
, rather than srw_app
.
On Gaea, the run_make_sfc_climo
task failed with rc=1
. Looking in the log, at the very end, the task encountered a segfault. Please see /lustre/f2/dev/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-969/gaea/expt_dirs/test_community/run_make_sfc_climo-log.txt
for more details.
@MichaelLueken There were missing directories for year 2024 in Gaea-C4, /lustre/f2/darshan/2024/*/*
I created those directories. Can you please try C4 again?
@RatkoVasic-NOAA -
Gaea and Gaea C5 have successfully cleared the Functional Workflow Task Tests
phase in the pipeline.
The pipeline was able to successfully build on Derecho. However, after passing the build, the Functional Workflow Task Tests
phase is now failing. The error message can be seen in the pipeline:
mkstemp: No such file or directory
qsub: could not create/open tmp file /glade/scratch/epicufsrt/.tmp/pbsscrpt3ZukW7
It isn't clear what the issue is. Additionally, there are no output files, so it is harder to try and figure out the reason for the failure.
@RatkoVasic-NOAA -
My attempt to replicate the failure from the automated Jenkins tests didn't pan out - the Workflow Task Tests script worked when I manually ran them in my own directory:
# Try derecho with the first few simple SRW tasks ...
run_make_grid: COMPLETE
run_get_ics: COMPLETE
run_get_lbcs: COMPLETE
run_make_orog: COMPLETE
run_make_sfc_climo: COMPLETE
run_make_ics: COMPLETE
run_make_lbcs: COMPLETE
run_fcst: COMPLETE
run_post: COMPLETE
I will retry submitting the test in the morning (hopefully the rest of the comprehensive tests will complete by that time). If it continues to fail, I will manually run the automated test script on Derecho so that we can move forward with this work.
The comprehensive tests have successfully passed on Hercules.
@RatkoVasic-NOAA - Here is the status report for the Jenkins tests:
The Jet tests are still running, but it looks like they will successfully pass without issue.
The Gaea C5 comprehensive tests have all passed successfully.
The Hercules comprehensive tests have all passed successfully.
Unfortunately, there were several failures while running the comprehensive tests on Orion. All of the verification WE2E tests have failed. The following error message is found in the log files:
Loading modules for task "run_vx" ...
Lmod has detected the following error: These module(s) or extension(s) exist
but cannot be loaded as requested: "python/3.10.8"
Try: "module spider python/3.10.8" to see how to load the module(s).
It's not clear to me why the App is encountering issues with loading python/3.10.8 on Orion. Attempting to add:
prepend_path("MODULEPATH", "/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core")
prepend_path("MODULEPATH", "/work/noaa/da/role-da/spack-stack/modulefiles")
load("stack-intel/2022.0.2")
directly to modulefiles/tasks/orion/run_vx.local.lua
doesn't correct the issue either.
There was a single failure in the comprehensive tests on Gaea. The grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
test failed in the run_fcst
task step with the log indicating a bus error:
srun: error: nid00026: task 664: Bus error (core dumped)
srun: Terminating StepId=269461812.0
slurmstepd: error: *** STEP 269461812.0 ON nid00000 CANCELLED AT 2024-01-02T21:43:12 ***
I suspect that a rerun would allow the test to successfully pass.
I went ahead and manually ran the comprehensive tests on Derecho and there were only four failures:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
2020_CAD COMPLETE 34.55
community COMPLETE 41.93
custom_ESGgrid COMPLETE 15.22
custom_ESGgrid_Central_Asia_3km DEAD 0.83
custom_ESGgrid_IndianOcean_6km COMPLETE 23.09
custom_ESGgrid_NewZealand_3km DEAD 0.99
custom_ESGgrid_Peru_12km COMPLETE 23.11
custom_ESGgrid_SF_1p1km COMPLETE 147.06
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE COMPLETE 11.43
custom_GFDLgrid COMPLETE 10.66
deactivate_tasks COMPLETE 0.98
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me COMPLETE 693.23
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS COMPLETE 21.32
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 COMPLETE 16.83
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta DEAD 2.12
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot DEAD 0.73
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR COMPLETE 170.42
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP COMPLETE 33.14
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot COMPLETE 36.34
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR COMPLETE 32.90
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta COMPLETE 32.57
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 COMPLETE 12.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot COMPLETE 26.35
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 26.91
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR COMPLETE 37.48
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP COMPLETE 78.53
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta COMPLETE 39.30
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP COMPLETE 19.22
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2 COMPLETE 12.43
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 COMPLETE 44.65
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta COMPLETE 35.14
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 COMPLETE 228.78
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson COMPLETE 310.50
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 COMPLETE 308.63
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR COMPLETE 334.54
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta COMPLETE 335.57
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 COMPLETE 32.53
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR COMPLETE 28.51
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 27.58
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 COMPLETE 20.58
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR COMPLETE 31.99
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 17.77
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 COMPLETE 254.15
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR COMPLETE 270.10
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 270.79
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP COMPLETE 79.01
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 COMPLETE 31.90
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR COMPLETE 40.62
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0 COMPLETE 31.29
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16 COMPLETE 50.24
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot COMPLETE 16.92
MET_ensemble_verification_only_vx COMPLETE 0.86
MET_verification_only_vx COMPLETE 0.17
nco COMPLETE 20.00
nco_ensemble COMPLETE 113.73
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 COMPLETE 31.06
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 23.56
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom COMPLETE 304.79
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR COMPLETE 26.90
pregen_grid_orog_sfc_climo COMPLETE 13.33
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS COMPLETE 11.83
specify_template_filenames COMPLETE 14.62
----------------------------------------------------------------------------------------------------
Total DEAD 4965.05
The failures are caused by issues in make_sfc_climo in UFS_UTILS. Additional work will be required to make the UFS_UTILS repository work properly on Derecho. These are known issues (please see issue #947 for more details).
@MichaelLueken - It looks like UFS-WM now supports Derecho and uses spack-stack/1.5.0 for intel and spack-stack/1.5.1 for gnu compilers. Could it be a good place (this PR) to update Derecho to spack-stack/1.5.0 for the SRW as well?
@natalie-perlin - Since this PR is intended to update the spack-stack version to 1.5.0, it would be fine to include transitioning Derecho to spack-stack from hpc-stack, especially since the UFS-WM has made this change as well. With the current issue being encountered on Orion (the verification WE2E tests are failing due to the inability to load python/3.10.8
, there is time to try and make spack-stack v1.5.0 work on Derecho. Thanks!
@RatkoVasic-NOAA -
I have made some progress on the verification issue on Orion. When I added load("build_orion_intel")
to the top of modulefiles/tasks/orion/run_vx.local.lua
, I was able to successfully load the task modulefile. Following this, I created a test file containing all of the tests that had failed initially:
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
MET_ensemble_verification_only_vx
MET_verification_only_vx
Two tests still failed with this setup:
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
test failed in make_sfc_climo
. Using rocotoboot on the failed task allowed the test to pass successfully.grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
test failed in make_ics_mem002
. Using rocotoboot on the failed task allowed the test to pass successfully.----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP COMPLETE 11.44
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 COMPLETE 22.21
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 28.82
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR COMPLETE 330.09
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 COMPLETE 14.87
MET_ensemble_verification_only_vx COMPLETE 1.33
MET_verification_only_vx COMPLETE 0.28
----------------------------------------------------------------------------------------------------
Total COMPLETE 409.04
A rerun of the Jenkins tests on Derecho shows that the tests continue to fail in the Workflow Tasks Test phase. Since the Workflow Tasks Test script runs fine manually, I have opened PSD-69 with the Platform team to see if they can think of any reason for this failure. At this point, I can only assume that there is an issue with the epicufsrt environment on Derecho.
@RatkoVasic-NOAA and @natalie-perlin -
The issue on Derecho has been identified. The /glade/u/home/epicufsrt/.bashrc
file hasn't been updated to transition from Cheyenne to Derecho. Within this file, the TMPDIR
variable is being set to /glade/scratch/epicufsrt/.tmp
. This location doesn't exist on Derecho. It needs to be set to /glade/derecho/scratch/epicufsrt/.tmp
. This is the culprit for the failed Workflow Task Tests on the machine (and likely the failure of the standard WE2E tests, if the Workflow Task Tests were deactivated for Derecho).
Unfortunately, I don't have access to the epicufsrt role account on Derecho. I have reached out to Jong and the Platform team to see if they can correct the entry in the .bashrc
file, but I don't know if they can. If push comes to shove, we might need to add:
export TMPDIR=/glade/derecho/scratch/epicufsrt/.tmp
directly into .cicd/scripts/wrapper_srw_ftest.sh
. I'll let you know what happens.
@MichaelLueken Somebody already changed it in .bashrc file. :-)
@RatkoVasic-NOAA - Jong was able to get in and make the necessary change. I'm rerunning the pipeline on Derecho to make sure that everything behaves as expected now.
@MichaelLueken @RatkoVasic-NOAA -
The CMake issue with Derecho is solved, and fundamental tests except for met verification tasks pass. Please see a PR into Ratkos branch ss150: https://github.com/RatkoVasic-NOAA/ufs-srweather-app/pull/9
Errors in met tasks are such as these:
/glade/derecho/scratch/nperlin/SRW/srw-ss150-upd/scripts/exregional_run_met_pcpcombine.sh: line 362: uw: command not found
Maybe somebody more familiar with theuw
could provide some help to solve these these issues for Derecho?
@natalie-perlin - I was able to get the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
WE2E test to run on Derecho by adding:
load("conda")
setenv("SRW_ENV", "srw_app")
to the end of the modulefiles/tasks/derecho/run_vx.local.lua
file. Attempting to just add:
load("python_srw")
to the end of the task modulefile resulted in failures associated with python/3.10.8
. The above method loads conda and the necessary conda environment (srw_app
) to run the verification tasks.
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012309205 COMPLETE 51.27
----------------------------------------------------------------------------------------------------
Total COMPLETE 51.27
--[[
Compiler-specific modules are used for met and metplus libraries
--]]
local met_ver = (os.getenv("met_ver") or "11.1.0")
local metplus_ver = (os.getenv("metplus_ver") or "5.1.0")
if (mode() == "load") then
load(pathJoin("met", met_ver))
load(pathJoin("metplus",metplus_ver))
end
local base_met = os.getenv("met_ROOT") or os.getenv("MET_ROOT")
local base_metplus = os.getenv("metplus_ROOT") or os.getenv("METPLUS_ROOT")
setenv("MET_INSTALL_DIR", base_met)
setenv("MET_BIN_EXEC", pathJoin(base_met,"bin"))
setenv("MET_BASE", pathJoin(base_met,"share/met"))
setenv("MET_VERSION", met_ver)
setenv("METPLUS_VERSION", metplus_ver)
setenv("METPLUS_ROOT", base_metplus)
setenv("METPLUS_PATH", base_metplus)
if (mode() == "unload") then
unload(pathJoin("met", met_ver))
unload(pathJoin("metplus",metplus_ver))
end
load("conda")
setenv("SRW_ENV", "srw_app")
@MichaelLueken @RatkoVasic-NOAA - confirming, all fundamental tests pass successfully now:
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 20.29
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 26.21
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 16.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 31.57
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012 COMPLETE 36.74
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240124072 COMPLETE 33.92
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012407234 COMPLETE 50.45
----------------------------------------------------------------------------------------------------
Total COMPLETE 215.46
Detailed summary written to /glade/derecho/scratch/nperlin/SRW/expt_dirs/WE2E_summary_20240124075353.txt
@natalie-perlin and @RatkoVasic-NOAA - Would you like for me to go ahead and add a PR to ss150
to correct the verification issues on the machine?
@MichaelLueken yes, please! I just saw Natalie's PR. Was that the same one you planned to do?
@MichaelLueken yes, please! I just saw Natalie's PR. Was that the same one you planned to do?
@RatkoVasic-NOAA - Yes, the update from Natalie's PR would have been the same as the PR I would have created. Also, reaching out to Jet sys admins, I was able to find a fix to allow the service partition to work once again. I will go ahead and open a PR into ss150 with the necessary modifications. Once this has been merged, I will go ahead and launch the comprehensive tests on Derecho and Jet, then we can move forward with this PR. Thanks!
@RatkoVasic-NOAA - While running a quick test with the updates for Jet, I noted that the grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
test was failing due to not knowing what to do with uw
in the verification scripts. I went ahead and made the necessary change, similar to what was done on Derecho, to the modulefiles/tasks/jet/run_vx.local.lua file
, as well as all machine run_vx.local.lua
files that don't load python_srw
. Once my current set of tests show that the tasks run, I will commit and open the PR into ss150
.
@RatkoVasic-NOAA - I have kicked off the comprehensive tests for Derecho and Jet. Once they complete, I will move forward with merging this PR. Thanks!
@RatkoVasic-NOAA - All comprehensive tests have successfully passed on Derecho:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
2020_CAD_20240125074415 COMPLETE 38.50
community_20240125074418 COMPLETE 44.32
custom_ESGgrid_20240125074419 COMPLETE 17.90
custom_ESGgrid_IndianOcean_6km_20240125074420 COMPLETE 26.31
custom_ESGgrid_Peru_12km_20240125074422 COMPLETE 25.84
custom_ESGgrid_SF_1p1km_20240125074423 COMPLETE 156.42
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202 COMPLETE 13.63
custom_GFDLgrid_20240125074426 COMPLETE 12.85
deactivate_tasks_20240125074427 COMPLETE 1.19
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me COMPLETE 683.30
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240125074430 COMPLETE 23.33
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202 COMPLETE 20.15
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125074433 COMPLETE 179.04
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125074 COMPLETE 38.04
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 40.30
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012507 COMPLETE 37.37
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202 COMPLETE 36.82
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 15.25
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 28.71
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 30.42
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012507 COMPLETE 39.51
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125074 COMPLETE 88.66
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202 COMPLETE 43.18
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240125074449 COMPLETE 21.71
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240 COMPLETE 14.94
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012507445 COMPLETE 49.53
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202401250 COMPLETE 38.04
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202401 COMPLETE 230.42
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson COMPLETE 327.41
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240125 COMPLETE 318.35
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125074 COMPLETE 347.89
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024 COMPLETE 343.14
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_ COMPLETE 35.61
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240125 COMPLETE 33.26
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 32.20
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_ COMPLETE 22.50
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012 COMPLETE 35.76
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 19.85
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2 COMPLETE 259.44
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202401250 COMPLETE 279.74
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20 COMPLETE 276.85
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125074518 COMPLETE 85.63
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202401 COMPLETE 34.23
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012507452 COMPLETE 45.11
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240125074 COMPLETE 33.57
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401250745 COMPLETE 52.32
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202 COMPLETE 19.10
MET_ensemble_verification_only_vx_20240125074528 COMPLETE 1.40
MET_ensemble_verification_winter_wx_20240125074531 COMPLETE 236.10
MET_verification_only_vx_20240125074534 COMPLETE 0.35
nco_20240125074536 COMPLETE 21.82
nco_ensemble_20240125074538 COMPLETE 133.71
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202 COMPLETE 35.15
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 25.56
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom COMPLETE 323.29
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024 COMPLETE 29.05
pregen_grid_orog_sfc_climo_20240125074549 COMPLETE 16.54
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240125074551 COMPLETE 14.31
specify_template_filenames_20240125074553 COMPLETE 17.31
----------------------------------------------------------------------------------------------------
Total COMPLETE 5452.23
The final Jet comprehensive tests have started now.
@MichaelLueken , @ulmononian caught one thing. We forget to change file system on build and workflow scripts for Gaea-C5 from F2 to F5. I just committed these two changes. Do you think there's any other place (like in Jenkins)?
@MichaelLueken , @ulmononian caught one thing. We forget to change file system on build and workflow scripts for Gaea-C5 from F2 to F5. I just committed these two changes. Do you think there's any other place (like in Jenkins)?
@RatkoVasic-NOAA - While queuing up the Jenkins tests this morning, it has come to my attention that Gaea C5 is no longer using the gaea-c5
label, but gaeac5
. I'm unsure if this will also require a renaming of the gaea-c5
modulefiles to gaeac5
, or if the SRW_PLATFORM
setting is still being set as gaea-c5
.
@MichaelLueken I found two more files pointing to old file system. As for name of machine, I don't think we changed anything, so I believe it will work with gaea-c5. But if consensus is to go without hyphen, I'm OK in changing it everywhere.
@RatkoVasic-NOAA - There are no longer any nodes associated with gaea-c5
in Jenkins. In order to run Jenkins on Gaea C5 moving forward, we will need to set gaeac5
in .cicd/Jenkinsfile
.
@MichaelLueken I'm looking into UFS WM PRs, and they are changing Gaea's name from gaea-c5 to just gaea. Can you check with whomever changed by just taking off hyphen if they also agree on having just gaea? It would be great to have same name across all applications.
@MichaelLueken I'm looking into UFS WM PRs, and they are changing Gaea's name from gaea-c5 to just gaea. Can you check with whomever changed by just taking off hyphen if they also agree on having just gaea? It would be great to have same name across all applications.
@jkbk2004 @zach1221 do you know how this call was made for the ufs-wm?
@RatkoVasic-NOAA - I'm using the SRW_App_Jenkinsfile_test
sandbox on Jenkins to test the changes on Gaea C5. If any changes are required, I will open one final PR to your ss150
branch to address the issues.
Additional information from Kris Booker:
As for right now Jenkins is referring to Gaea as node name GaeaC5 with a label name of 'gaeac5'. This was due to an issue with UFS WM pipeline.
So, for the purposes of the Jenkinsfile, renaming the gaea-c5
label to gaeac5
is the correct method.
@RatkoVasic-NOAA and @natalie-perlin - I have made the necessary modifications to allow the SRW App to successfully build on Gaea C5, but while attempting to run the WE2E coverage tests, the tests are all failing in make_grid with the following error message:
/gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/install_intel/exec/regional_esg_grid: symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d
I'll continue to dig around and see what might be happening, but I would appreciate any assistance you can provide, especially if this error message was encountered on other machines transitioning to spack-stack v.1.50.
My forked branch of @RatkoVasic-NOAA's ss150 branch can be found https://github.com/MichaelLueken/ufs-srweather-app/tree/ss150
@natalie-perlin - Making the changes to modulefiles/wflow_gaea-c5.lua
likely caused the issue. I forgot that Gaea C5 requires the old workflow_tools
conda environment to work. I'm working on correcting this in my branch, as well as updating the devbuild.sh
script to replace gaea-c5
with gaeac5
, and then rebuild and rerun the tests.
@RatkoVasic-NOAA - There were four WE2E comprehensive tests that failed on Jet:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
2020_CAD_20240125173400 COMPLETE 37.68
community_20240125173401 COMPLETE 17.23
custom_ESGgrid_20240125173403 COMPLETE 20.23
custom_ESGgrid_Central_Asia_3km_20240125173404 COMPLETE 35.80
custom_ESGgrid_Great_Lakes_snow_8km_20240125173405 COMPLETE 13.14
custom_ESGgrid_IndianOcean_6km_20240125173407 COMPLETE 17.01
custom_ESGgrid_NewZealand_3km_20240125173408 COMPLETE 71.01
custom_ESGgrid_Peru_12km_20240125173409 COMPLETE 21.22
custom_ESGgrid_SF_1p1km_20240125173411 COMPLETE 219.84
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202 COMPLETE 8.84
custom_GFDLgrid_20240125173413 COMPLETE 9.03
deactivate_tasks_20240125173414 COMPLETE 0.76
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me COMPLETE 1015.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024012 COMPLETE 6.60
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200_202401 COMPLETE 9.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202401 COMPLETE 9.08
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20 COMPLETE 50.75
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2 COMPLETE 1073.64
get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS_20240125173423 COMPLETE 6.76
get_from_HPSS_ics_HRRR_lbcs_RAP_20240125173424 COMPLETE 13.18
get_from_HPSS_ics_RAP_lbcs_RAP_20240125173426 COMPLETE 15.41
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240125173427 DEAD 12.19
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202 COMPLETE 19.23
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_ COMPLETE 361.94
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240 COMPLETE 166.24
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173432 COMPLETE 231.01
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173 COMPLETE 36.20
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 42.18
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517 COMPLETE 39.96
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202 COMPLETE 39.95
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 10.31
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 16.53
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 14.47
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517 COMPLETE 15.93
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173 COMPLETE 43.60
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202 COMPLETE 18.64
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240125173447 COMPLETE 10.07
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240 COMPLETE 7.01
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012517345 COMPLETE 18.43
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202401251 COMPLETE 14.50
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202401 COMPLETE 328.71
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson COMPLETE 3282.92
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240125 COMPLETE 419.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173 COMPLETE 514.55
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024 COMPLETE 520.15
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_ COMPLETE 33.34
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240125 COMPLETE 31.89
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 30.89
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_ COMPLETE 10.80
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012 COMPLETE 24.02
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 8.62
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2 COMPLETE 365.49
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202401251 COMPLETE 434.65
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20 COMPLETE 444.76
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173513 COMPLETE 95.37
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202401 COMPLETE 20.92
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012517351 COMPLETE 23.69
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240125173 COMPLETE 20.38
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401251735 COMPLETE 29.89
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202 COMPLETE 12.31
long_fcst_20240125173522 COMPLETE 63.52
MET_ensemble_verification_only_vx_20240125173523 COMPLETE 1.33
MET_ensemble_verification_only_vx_time_lag_20240125173526 DEAD 4.41
MET_ensemble_verification_winter_wx_20240125173528 COMPLETE 118.67
MET_verification_only_vx_20240125173531 COMPLETE 0.27
nco_20240125173533 COMPLETE 7.73
nco_ensemble_20240125173535 COMPLETE 73.55
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202 COMPLETE 33.75
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ DEAD 3.46
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom COMPLETE 434.71
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024 COMPLETE 10.79
pregen_grid_orog_sfc_climo_20240125173546 COMPLETE 8.81
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240125173548 COMPLETE 7.08
specify_template_filenames_20240125173549 DEAD 6.47
----------------------------------------------------------------------------------------------------
Total DEAD 11216.74
The Jenkins working directory on Jet is /mnt/lfs1/NAGAPE/epic/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-969/jet/expt_dirs.
The get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS
test failed in the make_ics
task due to a bad allocation. I suspect a rerun will allow this test to pass.
The MET_ensemble_verification_only_vx_time_lag
test failed in the run_MET_PcpCombine_fcst_APCP0*h_mem00*
tasks due to OBS_DIR does not exist or is not a directory. The tasks that pulled the necessary data successfully completed, so hopefully a rerun will work here as well.
The nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
test failed in the make_lbcs
task due to OOM kill events. A rerun should work.
The specify_template_filenames
test failed in the run_fcst_mem000
task due to CFL violation. Hopefully a rerun will correct this.
I'm resubmitting the failed jobs now and will let you know how they look.
@RatkoVasic-NOAA - Three of the tests that had failed are now successfully passing on Jet:
expt_name = "MET_ensemble_verification_only_vx_time_lag"
wflow_status = "SUCCESS"
expt_name = "get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS"
wflow_status = "SUCCESS"
expt_name = "nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16"
wflow_status = "SUCCESS"
The specify_template_filenames
test is still failing in run_fcst
with:
FATAL from PE 1: compute_qs: saturation vapor pressure table overflow, nbad= 1
It doesn't make sense to me that a single test is encountering a CFL violation and not the rest of the tests that also use FV3_GFS_v15p2
SDF.
For Gaea C5, running the WE2E fundamental tests on the machine fail in make_grid
with:
/gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/install_intel/exec/regional_esg_grid: symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d
This is using the BUILD_CONDA
option to create the srw_app
conda environment on the machine. While applying the necessary changes that @natalie-perlin made to allow the WE2E tests to run on Gaea C5 with the F2 filesystem (not using the BUILD_CONDA
option and using the old workflow_tools
conda environment instead), the WE2E tests will fail to generate because uwtools
isn't in the workflow_tools
conda environment. Following the merging of PR #994, uwtools
needs to be in the conda environment, otherwise the templater tool will fail to generate.
@MichaelLueken I'm running now 'specify_template_filenames' test on Jet. Last time I ran it it worked. I'll start from beginning.
@RatkoVasic-NOAA - After many rocotorewind/rocotoboot
to run_fcst
in specify_template_filenames
, all of the WE2E comprehensive tests successfully passed on Jet:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
2020_CAD_20240125173400 COMPLETE 37.68
community_20240125173401 COMPLETE 17.23
custom_ESGgrid_20240125173403 COMPLETE 20.23
custom_ESGgrid_Central_Asia_3km_20240125173404 COMPLETE 35.80
custom_ESGgrid_Great_Lakes_snow_8km_20240125173405 COMPLETE 13.14
custom_ESGgrid_IndianOcean_6km_20240125173407 COMPLETE 17.01
custom_ESGgrid_NewZealand_3km_20240125173408 COMPLETE 71.01
custom_ESGgrid_Peru_12km_20240125173409 COMPLETE 21.22
custom_ESGgrid_SF_1p1km_20240125173411 COMPLETE 219.84
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202 COMPLETE 8.84
custom_GFDLgrid_20240125173413 COMPLETE 9.03
deactivate_tasks_20240125173414 COMPLETE 0.76
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me COMPLETE 1015.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024012 COMPLETE 6.60
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200_202401 COMPLETE 9.01
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202401 COMPLETE 9.08
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20 COMPLETE 50.75
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2 COMPLETE 1073.64
get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS_20240125173423 COMPLETE 6.76
get_from_HPSS_ics_HRRR_lbcs_RAP_20240125173424 COMPLETE 13.18
get_from_HPSS_ics_RAP_lbcs_RAP_20240125173426 COMPLETE 15.41
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240125173427 COMPLETE 13.65
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202 COMPLETE 19.23
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_ COMPLETE 361.94
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240 COMPLETE 166.24
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173432 COMPLETE 231.01
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173 COMPLETE 36.20
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 42.18
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517 COMPLETE 39.96
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202 COMPLETE 39.95
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 10.31
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 16.53
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 14.47
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024012517 COMPLETE 15.93
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173 COMPLETE 43.60
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202 COMPLETE 18.64
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240125173447 COMPLETE 10.07
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240 COMPLETE 7.01
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024012517345 COMPLETE 18.43
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202401251 COMPLETE 14.50
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202401 COMPLETE 328.71
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson COMPLETE 3282.92
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240125 COMPLETE 419.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240125173 COMPLETE 514.55
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024 COMPLETE 520.15
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_ COMPLETE 33.34
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240125 COMPLETE 31.89
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 30.89
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_ COMPLETE 10.80
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012 COMPLETE 24.02
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 8.62
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2 COMPLETE 365.49
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202401251 COMPLETE 434.65
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20 COMPLETE 444.76
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240125173513 COMPLETE 95.37
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202401 COMPLETE 20.92
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024012517351 COMPLETE 23.69
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240125173 COMPLETE 20.38
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401251735 COMPLETE 29.89
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202 COMPLETE 12.31
long_fcst_20240125173522 COMPLETE 63.52
MET_ensemble_verification_only_vx_20240125173523 COMPLETE 1.33
MET_ensemble_verification_only_vx_time_lag_20240125173526 COMPLETE 6.26
MET_ensemble_verification_winter_wx_20240125173528 COMPLETE 118.67
MET_verification_only_vx_20240125173531 COMPLETE 0.27
nco_20240125173533 COMPLETE 7.73
nco_ensemble_20240125173535 COMPLETE 73.55
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202 COMPLETE 33.75
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 14.34
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom COMPLETE 434.71
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024 COMPLETE 10.79
pregen_grid_orog_sfc_climo_20240125173546 COMPLETE 8.81
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240125173548 COMPLETE 7.08
specify_template_filenames_20240125173549 COMPLETE 11.27
----------------------------------------------------------------------------------------------------
Total COMPLETE 11235.73
On Gaea C5, while compiling, I am seeing the following messages, which didn't appear before the transition to the F5 filesystem:
[ 0%] Building Fortran object sorc/emcsfc_ice_blend.fd/CMakeFiles/emcsfc_ice_blend.dir/emcsfc_ice_blend.f90.o
No supported cpu target is set, CRAY_CPU_TARGET=x86-64 will be used.
Load a valid targeting module or set CRAY_CPU_TARGET
Additionally, I'm trying to see if the modifications to etc/lmod-setup.sh
and etc/lmod-setup.csh
(replacing the calls to source /lustre/f2/dev/role.epic/contrib/Lmod_init_C5.sh
and source /lustre/f2/dev/role.epic/contrib/Lmod_init_C5.csh
, respectively, with module reset
) might be causing issues.
It's unclear why the regional_esg_grid
executable would encounter symbol lookup error: /usr/lib64/libssh.so.4: undefined symbol: EVP_KDF_CTX_new_id, version OPENSSL_1_1_1d
.
DESCRIPTION OF CHANGES:
Update SRW with spack-stack 1.5.0 Machines affected:
Type of change
TESTS CONDUCTED:
Fundamental tests performed.
DEPENDENCIES:
PR #973 and it's follow-up PR
ISSUE:
This solves issue #946 Issue to be solved: #991
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR: