ufs-community / regional_workflow

THIS REPOSITORY IS NOW DEPRECATED; SEE UFS SRW APP FOR CURRENT CODE
https://github.com/ufs-community/ufs-srweather-app
Other
10 stars 87 forks source link

Update default data locations to make "config.community.sh" case work correctly #820

Closed mkavulich closed 2 years ago

mkavulich commented 2 years ago

DESCRIPTION OF CHANGES:

On a few different platforms (Cheyenne, Hera, and Orion), the default case specified in the file ush/config.community.sh does not work correctly because the input data can not be found. This PR modifies the default input data location on those platforms so that this test case will succeed "out of the box" (user only needs to specify machine and account) by pointing to the staged data file directories on those respective platforms.

TESTS CONDUCTED:

Ran the case specified by ush/config.community.sh on Cheyenne, Hera, and Orion; all succeeded after these changes. Also ran it on Jet; this succeeded because the already-specified default location on Jet has the appropriate data.

Did not run any WE2E tests, although this should not change any of those results. Will run for Hera and Jet using CI tags to make sure nothing major has broken.

DEPENDENCIES:

None

DOCUMENTATION:

None

venitahagerty commented 2 years ago

Machine: hera Compiler: intel Job: WE Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1022938469/20220810165016/ufs-srweather-app Build was Successful Rocoto jobs started Long term tracking will be done on 9 experiments If test failed, please make changes and add the following label back: ci-hera-intel-WE Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta 2022-08-10 17:22:10 +0000 :: hfe03 :: Task make_grid, jobid=34594822, in state DEAD (FAILED), ran for 6.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 2022-08-10 17:20:06 +0000 :: hfe12 :: Task make_grid, jobid=34594757, in state DEAD (FAILED), ran for 8.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta 2022-08-10 17:20:09 +0000 :: hfe04 :: Task make_grid, jobid=34594760, in state DEAD (FAILED), ran for 6.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR 2022-08-10 17:20:05 +0000 :: hfe11 :: Task make_grid, jobid=34594752, in state DEAD (FAILED), ran for 9.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2 2022-08-10 17:20:14 +0000 :: hfe01 :: Task make_grid, jobid=34594763, in state DEAD (FAILED), ran for 8.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 2022-08-10 17:20:14 +0000 :: hfe03 :: Task make_grid, jobid=34594762, in state DEAD (FAILED), ran for 9.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR 2022-08-10 17:20:05 +0000 :: hfe08 :: Task make_grid, jobid=34594764, in state DEAD (FAILED), ran for 6.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR 2022-08-10 17:20:06 +0000 :: hfe01 :: Task make_grid, jobid=34594756, in state DEAD (FAILED), ran for 8.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR 2022-08-10 17:24:12 +0000 :: hfe02 :: Task make_ics, jobid=34594847, in state DEAD (FAILED), ran for 7.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR 2022-08-10 17:24:12 +0000 :: hfe02 :: Task make_lbcs, jobid=34594848, in state DEAD (FAILED), ran for 8.0 seconds, exit status=256, try=2 (of 2) All experiments completed

venitahagerty commented 2 years ago

Machine: jet Compiler: intel Job: WE Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1022938469/20220810165015/ufs-srweather-app Build was Successful Rocoto jobs started Long term tracking will be done on 9 experiments If test failed, please make changes and add the following label back: ci-jet-intel-WE Experiment Failed on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR 2022-08-10 17:36:14 +0000 :: fe1 :: Task make_grid, jobid=8661302, in state DEAD (FAILED), ran for 5.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 2022-08-10 17:36:10 +0000 :: fe6 :: Task make_grid, jobid=8661303, in state DEAD (FAILED), ran for 6.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta 2022-08-10 17:36:08 +0000 :: fe3 :: Task get_extrn_lbcs, jobid=8661292, in state DEAD (OUT_OF_MEMORY), ran for 107.0 seconds, exit status=253, try=1 (of 1) Experiment Failed on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR 2022-08-10 17:36:05 +0000 :: fe1 :: Task make_ics, jobid=8661350, in state DEAD (FAILED), ran for 7.0 seconds, exit status=256, try=1 (of 1) Experiment Failed on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR 2022-08-10 17:36:05 +0000 :: fe1 :: Task make_lbcs, jobid=8661351, in state DEAD (FAILED), ran for 7.0 seconds, exit status=256, try=1 (of 1) Experiment Failed on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 2022-08-10 17:34:10 +0000 :: fe7 :: Task make_grid, jobid=8661280, in state DEAD (FAILED), ran for 6.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 2022-08-10 17:34:10 +0000 :: fe7 :: Task get_extrn_lbcs, jobid=8661167, in state DEAD (OUT_OF_MEMORY), ran for 196.0 seconds, exit status=253, try=1 (of 1) Experiment Failed on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2 2022-08-10 17:42:09 +0000 :: fe6 :: Task make_grid, jobid=8661372, in state DEAD (FAILED), ran for 5.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR 2022-08-10 17:42:12 +0000 :: fe8 :: Task make_grid, jobid=8661347, in state DEAD (FAILED), ran for 4.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR 2022-08-10 17:42:07 +0000 :: fe7 :: Task make_grid, jobid=8661343, in state DEAD (FAILED), ran for 5.0 seconds, exit status=256, try=2 (of 2) Experiment Failed on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta 2022-08-10 18:04:11 +0000 :: fe5 :: Task make_grid, jobid=8661387, in state DEAD (TIMEOUT), ran for 1214.0 seconds, exit status=15, try=2 (of 2)

JeffBeck-NOAA commented 2 years ago

@mkavulich, I approved, but do we know why the ci tests aren't succeeding on Jet for FV3GFS, HRRR, and RAP data? We should probably remove all GSMGFS WE2E tests at this point.

venitahagerty commented 2 years ago

Machine: hera Compiler: intel Job: WE Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1022938469/20220818165012/ufs-srweather-app Build was Successful Rocoto jobs started Long term tracking will be done on 9 experiments If test failed, please make changes and add the following label back: ci-hera-intel-WE Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2 Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta All experiments completed

venitahagerty commented 2 years ago

Machine: jet Compiler: intel Job: WE Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1022938469/20220818165012/ufs-srweather-app Build was Successful Rocoto jobs started Long term tracking will be done on 9 experiments If test failed, please make changes and add the following label back: ci-jet-intel-WE Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2 Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 Experiment Failed on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR 2022-08-18 18:20:11 +0000 :: fe4 :: Task make_orog, jobid=9242707, in state DEAD (TIMEOUT), ran for 1206.0 seconds, exit status=15, try=2 (of 2) Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2

mkavulich commented 2 years ago

@christinaholtNOAA This problem all came about because the config.community.sh case just fails on most platforms, because while the data is staged, there is no logic that points the workflow to look there by default. All the documentation says about "EXTRN_MDL_SYSBASEDIR_ICS" is that it's the "Base directory on the local machine containing external model files for generating ICs on the native grid. " Doesn't say anything about realtime streams. So if that is the case, it needs more definition. Based on the current info, on platforms without realtime streams, I thought it made sense to just point to the staged data so that this default case will work.

I am curious about your edit comment "I think that setting the ush/config.community.sh to find the staged disk files in the same way the test configs do it would be a better solution." Do you think that we should introduce platform-specific logic to this config file? I think that would not work well as an "example" config.sh file.

christinaholtNOAA commented 2 years ago

@mkavulich I think it's worth documenting the assumption I just made if others are not opposed. Taking up this mechanism with staged data for canned cases seems redundant and provides no alternative for folks who want to know where supported data streams are on platforms where they may like to run in near-real time.

Jet, Hera, and potentially some user-defined cloud platforms will have data streams that support real time runs.

I think that using the "staged file" approach wouldn't need machine-specific logic since all the supported platforms provide the necessary information under TEST_EXTRN_MDL_SOURCE_BASEDIR. You could continue to leave a machine-specific example in as a commented guideline if you wanted, I think. Or tell users to change this directory to one they choose. In all likelihood, real Community users are going to stage their own data anyway, and will need to know how to do that.

One more thought....a good life motto: if it's not tested, assume it's broken. That's exactly why we have this PR. Is there a possibility that users could be directed to the 80+ examples of how to do this using the currently tested WE2E tests, and this one could disappear altogether. Alternatively, can this example script (and its NCO mode counterpart) be tested as a WE2E test so that it doesn't end up broken for extended periods until a user needs it?

mkavulich commented 2 years ago

We will continue this discussion after the merge; see https://github.com/ufs-community/ufs-srweather-app/issues/341