ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

MET_ensemble_verification_only_vx_time_lag no longer works on Tier 1 machines #900

Closed natalie-perlin closed 11 months ago

natalie-perlin commented 11 months ago

MET verification tests use modules met and metplus from software stacks on Tier 1 machines, and the changes were implemented in PR-826 (https://github.com/ufs-community/ufs-srweather-app/pull/826) Since then, changes that were implemented affected the MET verification tasks, and MET_ensemble_verification_only_vx_time_lag no longer seem to work (Tested Hera, Gaea, Orion; new platform Derecho). Tasks get_obs_ccpa, get_obs_mrms, get_obs_ndas fail.

Log files could be viewed on Hera: /scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/expt_dirs/MET_ensemble_verification_only_vx_time_lag/log/get_obs_ndas_2021050500.log, get_obs_mrms_2021050500.log, get_obs_ccpa_2021050500.log

Attached are the getobs*_2021050500.log files, var_defns.sh and generated FV3LAM_wflow.xml workflow.

Expected behavior

MET_ensemble_verification_only_vx_time_lag test passes successfully on Hera (intel and gnu, Gaea, Orion, Jet, Derecho). Tasks get_obs_ccpa, get_obs_mrms, get_obs_ndas do not need to be run, as the data is staged on these systems.

Current behavior

Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_only_vx_time_lag                         DEAD                   0.00
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                   0.00

Tasks that fail are get_obs_ccpa, get_obs_mrms, get_obs_ndas.

Machines affected

Any system running SRW

Steps To Reproduce

Example for Orion :

git clone -b develop https://github.com/ufs-community/ufs-srweather-app.git
cd ufs-srweather-app/
./manage_externals/checkout_externals 
source etc/lmod-setup.sh orion
module use $PWD/modulefiles
./devbuild.sh -p=orion -c=intel 
cd tests/WE2E
module load wflow_orion
conda activate workflow_tools
./run_WE2E_tests.py -t MET_ensemble_verification_only_vx_time_lag  -m orion -a epic

See the bug... -->

calling function that monitors jobs, prints summary
Writing information for all experiments to WE2E_tests_20230905195031.yaml
Checking tests available for monitoring...
Starting experiment MET_ensemble_verification_only_vx_time_lag running
Updating database for experiment MET_ensemble_verification_only_vx_time_lag
Setup complete; monitoring 1 experiments
Use ctrl-c to pause job submission/monitoring
09/05/23 19:50:45 UTC :: FV3LAM_wflow.xml :: Cycle 202105050000, Task get_obs_ccpa, jobid=49135445, in state DEAD (FAILED), ran for 6.0 seconds, exit status=256, try=1 (of 1)
09/05/23 19:50:45 UTC :: FV3LAM_wflow.xml :: Cycle 202105050000, Task get_obs_mrms, jobid=49135446, in state DEAD (FAILED), ran for 5.0 seconds, exit status=256, try=1 (of 1)
09/05/23 19:50:45 UTC :: FV3LAM_wflow.xml :: Cycle 202105050000, Task get_obs_ndas, jobid=49135447, in state DEAD (FAILED), ran for 5.0 seconds, exit status=256, try=1 (of 1)
Experiment MET_ensemble_verification_only_vx_time_lag is DEAD
Took 0:00:23.103369; will no longer monitor.
All 1 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_only_vx_time_lag                         DEAD                   0.00
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                   0.00

In MET_ensemble_verification_only_vx_time_lag tests done before merging the PR-826, no get_obs_ccpa, get_obs_mrms, get_obs_ndas were run, as all of the data were staged on each machine.

An example of a successful MET_ensemble_verification_only_vx_time_lag test on Hera: SRW base directory: /scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/srw-dev-met Experiment directory: /scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/INTEL/MET_ensemble_verification_only_vx_time_lag

Detailed Description of Fix (optional)

May need to be related to configurations in parm/wflow/*.yaml, and ./ush/machine/verify_*.yaml files

Additional Information (optional)

There are differences between machine files used in PR-826, i.e., setting the OBS data directories, and their current versions (and also earlier version, before the PR-826). Example for Hera; used the following data in PR-826:

  CCPA_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ccpa/proc
  MRMS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/mrms/proc
  NDAS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ndas/proc

Current hera.yaml and the machine file before the merge of PR-826 contain the following:

  TEST_CCPA_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ccpa/proc
  TEST_MRMS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/mrms/proc
  TEST_NDAS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ndas/proc

Possible Implementation (optional)

Output (optional)

get_obs_ccpa_2021050500.log get_obs_ndas_2021050500.log get_obs_mrms_2021050500.log var_defns.sh.txt FV3LAM_wflow.xml.txt

natalie-perlin commented 11 months ago

@mkavulich - some input on the changes from #PR-864 and whether they could could have affected this test could be really helpful! I'm not sure the Some changes that could be relevant: Verification task names likely changed from get_obs_* to get_verif_obs as a part of #PR-864, yet the tasks that fail in MET_ensemble_verification_only_vx_time_lag are named get_obs_ccpa, get_obs_mrms, get_obs_ndas, despite being part of a verification task workflow.

MichaelLueken commented 11 months ago

@natalie-perlin -

I was able to clone the develop branch on Orion, build the SRW App, then submit the MET_ensemble_verification_only_vx_time_lag. My test failed due to HPSS not being available on Orion. The get_obs_* tasks should be pointing to the JREGIONAL_GET_VERIF_OBS j-job file, i.e., <command>&LOAD_MODULES_RUN_TASK_FP; "get_obs" "&JOBSdir;/JREGIONAL_GET_VERIF_OBS"</command>

The test was fundamentally changed in PR #864 to require the verification data to be pulled from HPSS (please see lines 31-34 of the MET_ensemble_verification_only_vx_time_lag configuration file). The test no longer uses the staged data. With this change, this test can only be run on Hera and Jet. It should also be noted that the data in question appears to contain restricted data. If you aren't a member of the rstprod project, then you will be unable to pull the necessary data from HPSS, resulting in the test failing.

MichaelLueken commented 11 months ago

@natalie-perlin - I can confirm that the removal of line 31-34 in the MET_ensemble_verification_only_vx_time_lag configuration file will allow the test to run without issue using the staged data. However, as noted above, the purpose of the test is try and pull the verification from HPSS and then run the test.

mkavulich commented 11 months ago

@MichaelLueken thanks for jumping in with a reply. Your summary is correct: these two WE2E tests are intended to only check for data on HPSS. I used HPSS data for the MET_ensemble_verification_only_vx_time_lag test because, at the time, data was only staged for that test on Hera, so it couldn't be run on other machines anyway. In addition, the function of the tasks get_obs_ccpa, get_obs_mrms, and get_obs_ndas changed with that PR. They should be run regardless of whether data is being pulled from HPSS or disk (this makes the formatting of config.yaml much easier for most cases); in the latter case the task checks to ensure all the necessary data is available on disk.

If there is a desire to make the time-lag test use staged data that would be fine, but at least one of the verification tests should be run pulling data from HPSS to test that functionality.

natalie-perlin commented 11 months ago

@mkavulich @MichaelLueken - thank you for your comments

MichaelLueken commented 11 months ago

@natalie-perlin - As per today's meeting, please ensure that you log into AIM and request access to the rstprod project. You will be asked to provide justification to be granted permission. If you include:

The Short-Range Weather Application (SRW App) runs workflow end-to-end tests to ensure that modifications to the code don't adversely affect development. Among these tests, there are verification tests that require observational data from HPSS. Unfortunately, these data sets are included in tarballs that also contain restricted data. Due to this, verification tests that need to pull data from HPSS are failing due to the lack of rstprod project access.

access should be granted. Once you have access to rstprod on RDHPCS, you will need to let HPSS know that you have been granted access to rstprod on RDHPCS so that you can pull the tarball that contains restricted data from HPSS. The email for the HPSS helpdesk is rdhpcs.hpss.help@noaa.gov. Skylar Nelson is the lead for the HPSS helpdesk, so including an email to him might expedite the process.

Closing this issue now.