Closed MichaelLueken closed 3 months ago
Remove ush/set_ozone_param.py (ozphys scheme in SDFs were removed in the weather model)
For more background information on this point, the stratospheric ozone physics schemes were reorganized (see https://github.com/ufs-community/ufs-weather-model/pull/1851, https://github.com/NOAA-EMC/fv3atm/pull/661, https://github.com/ufs-community/ccpp-physics/pull/75) so that the ozone physics schemes are now controlled by input.nml, where previously they were controlled by both namelist and suite definition file. So any future ozone physics changes will need to be tied to the namelist options: currently the only supported ozone suite is the NRL 2015 ozone scheme (oz_phys_2015 = .true.
) so there's no need for any special scheme-specific logic, hence the removal of that file.
@MichaelLueken I have run into a problem on Derecho that I'm unable to solve: this occurred both in my preliminary branch and your current branch. It has to do with the installation of the srw conda packages:
error libmamba Bad conversion of Python version '3.10.12': filesystem error: temp_directory_path: No such file or directory
./Miniforge3-Linux-x86_64.sh: line 339: 109438 Segmentation fault (core dumped) CONDA_SAFETY_CHECKS=disabled CONDA_EXTRA_SAFETY_CHECKS=no CONDA_CHANNELS="conda-forge" CONDA_PKGS_DIRS="$PREFIX/pkgs" "$CONDA_EXEC" install --offline --file "$PREFIX/pkgs/env.txt" -yp "$PREFIX"
./devbuild.sh: line 228: conda/etc/profile.d/conda.sh: No such file or directory
./devbuild.sh: line 233: conda: command not found
./devbuild.sh: line 234: conda: command not found
./devbuild.sh: line 235: mamba: command not found
./devbuild.sh: line 237: conda: command not found
./devbuild.sh: line 238: mamba: command not found
I assume we'll need to enlist the help of the unified workflow team on this one? Or it could be related to the updated spack-stack build. Regardless, I'll wait to see if you can replicate the problem to make sure it's not just a problem with my environment.
@mkavulich I have just cloned a fresh copy of the feature/hash_update
branch on Derecho and I was able to successfully build the App using ./devbuild.sh -p=derecho
:
[100%] Built target ufs-weather-model
Install the project...
-- Install configuration: "RELEASE"
-- Installing: /glade/derecho/scratch/mlueken/ufs-srweather-app/derecho/exec/ufs_srweather_app.settings
mlueken@derecho6:/glade/derecho/scratch/mlueken/ufs-srweather-app/derecho>
@MichaelLueken Did you check that the conda package installed correctly as well? The code actually builds successfully for me, it's the conda package that fails to install.
@mkavulich Yes, the conda package was correctly installed. Both the srw_app
and srw_graphics
conda environments were also created. The fundamental tests were ran and all passed successfully.
My working copy on Derecho can be found - /glade/derecho/scratch/mlueken/ufs-srweather-app/derecho
. The fundamental test results can be found /glade/derecho/scratch/mlueken/ufs-srweather-app/expt_dirs
.
Thanks for confirming, and sorry for cluttering up the PR with my own issues. I did confirm that it works on Hera, so I'll continue my testing there while I try to figure out this Derecho issue.
Edit: for future reference, this conda error was caused by my environment containing the environment variable TMPDIR which was pointing to a non-existent directory. This is the issue that helped me solve it: https://github.com/conda-forge/miniforge/issues/474
@MichaelLueken, I was able to build the app on Derecho successfully. However, after yesterday's PM, it fails on Hera with the following message:
Error running link command: Segmentation fault
make[5]: *** [FV3/ccpp/framework/src/CMakeFiles/ccpp_framework.dir/build.make:97: FV3/ccpp/framework/src/libccpp_framework.a] Error 1
make[5]: *** Deleting file 'FV3/ccpp/framework/src/libccpp_framework.a'
make[4]: *** [CMakeFiles/Makefile2:469: FV3/ccpp/framework/src/CMakeFiles/ccpp_framework.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....
This may be a system issue. Do you have any idea what happens there?
Thanks for the review, @chan-hoo! At the moment, I suspect that the issue is due to Rocky8 transition. Once PR #1054 is merged, I will update my feature/hash_update
branch to the HEAD of develop
. Hopefully this is all that should be required.
@chan-hoo -
I have updated my branch to the HEAD of develop
. The SRW is successfully building once again on the default Rocky front ends:
[100%] Completed 'ufs-weather-model'
[100%] Built target ufs-weather-model
Install the project...
-- Install configuration: "RELEASE"
-- Installing: /scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/hera/exec/ufs_srweather_app.settings
Please let me know if you continue to encounter issues while compiling or running on Hera.
@MichaelLueken, it works well now. :) Thanks!
The update made to aqm_environment.yml
was enough to allow the AQM WE2E test to successfully run on Hera Rocky8:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240320175754 COMPLETE 4890.58
----------------------------------------------------------------------------------------------------
Total COMPLETE 4890.58
The Jenkins runner on Hera appears to have connected to a CentOS front end when maintenance concluded. The Jenkins tests on Hera failed to compile the SRW App. Reached out to the Platform Team via PSD-85 to request that they connect to a Rocky8 front end.
Additionally, on Hera and Jet, the Functional WorkflowTaskTests
are failing in run_fcst
with the following error message:
FATAL from PE 0: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic
Further investigation is necessary to see why the tests are running fine on Derecho, Hercules, and Orion, but not on Hera and Jet (Why is PBSpro okay, but not Slurm?). Additionally, why is there an issue with the wrapper scripts, but not when run as part of the workflow?
The WE2E coverage tests were manually launched on Hera Intel and successfully passed:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240321180137 COMPLETE 31.36
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024032 COMPLETE 6.84
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2 COMPLETE 1515.46
get_from_HPSS_ics_HRRR_lbcs_RAP_20240321180143 COMPLETE 14.76
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 7.43
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 14.16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240321180147 COMPLETE 10.66
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240 COMPLETE 7.74
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403 COMPLETE 447.16
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240321 COMPLETE 587.99
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403211 COMPLETE 1024.58
pregen_grid_orog_sfc_climo_20240321180154 COMPLETE 7.67
----------------------------------------------------------------------------------------------------
Total COMPLETE 3675.81
As part of this PR, I removed the use of the PET list and added back in the original atmos_nthreads
capability. This works fine while running the workflow and using the wrapper scripts (Functional WorkflowTaskTests - wrapper_srw_ftest.sh
) on systems that use PBSPro, but the wrapper scripts on systems that use Slurm are failing due to bad PET list bounds. Adding back in the PET list allows the wrapper scripts to pass on Slurm systems, but cause the workflow to fail in run_fcst
.
I'm now doing a deep dive to see if the PET list aspect of the weather model has been updated without an update to the documentation.
I had been updating the pe_member01_m1
entry in ufs.configure
, rather than updating the PE_MEMBER01
entry in config_defaults.yaml
. This was leading to the correct output in the ufs.configure
file, but an incorrect value for PE_MEMBER01
, leading to failure. Applying the necessary update to PE_MEMBER01
is now allowing the majority of the fundamental tests to properly run.
Unfortunately, the grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
WE2E test is still failing. The failure message is:
FATAL from PE 6: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic
It appears as though the modification to DT_ATMOS
, LAYOUT_X
, LAYOUT_Y
, or BLOCKSIZE
is having an adverse effect on the test. Will investigate further.
The Jenkins tests successfully passed. After addressing conflicts, the AQM WE2E test was ran one last time and also successfully passed:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240327143007 COMPLETE 4865.32
----------------------------------------------------------------------------------------------------
Total COMPLETE 4865.32
Merging this PR now.
DESCRIPTION OF CHANGES:
This PR will update the ufs-weather-model hash to 8518c2c (March 1), the UPP hash to 945cb2c (January 23), and the UFS_UTILS hash to 57bd832 (February 6).
This work also required several modifications to allow the updated weather model and UFS_UTILS hashes to work in the SRW:
Type of change
TESTS CONDUCTED:
DEPENDENCIES:
None
DOCUMENTATION:
Documentation in ConfigWorkflow.rst has been updated to show renaming of NEMS/nems to UFS/ufs.
ISSUE:
Fixes #1049
CHECKLIST
CONTRIBUTORS (optional):
@mkavulich