ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
53 stars 114 forks source link

[develop] Update weather model, UPP, and UFS_UTILS hashes #1050

Closed MichaelLueken closed 3 months ago

MichaelLueken commented 4 months ago

DESCRIPTION OF CHANGES:

This PR will update the ufs-weather-model hash to 8518c2c (March 1), the UPP hash to 945cb2c (January 23), and the UFS_UTILS hash to 57bd832 (February 6).

This work also required several modifications to allow the updated weather model and UFS_UTILS hashes to work in the SRW:

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

None

DOCUMENTATION:

Documentation in ConfigWorkflow.rst has been updated to show renaming of NEMS/nems to UFS/ufs.

ISSUE:

Fixes #1049

CHECKLIST

CONTRIBUTORS (optional):

@mkavulich

mkavulich commented 4 months ago

Remove ush/set_ozone_param.py (ozphys scheme in SDFs were removed in the weather model)

For more background information on this point, the stratospheric ozone physics schemes were reorganized (see https://github.com/ufs-community/ufs-weather-model/pull/1851, https://github.com/NOAA-EMC/fv3atm/pull/661, https://github.com/ufs-community/ccpp-physics/pull/75) so that the ozone physics schemes are now controlled by input.nml, where previously they were controlled by both namelist and suite definition file. So any future ozone physics changes will need to be tied to the namelist options: currently the only supported ozone suite is the NRL 2015 ozone scheme (oz_phys_2015 = .true.) so there's no need for any special scheme-specific logic, hence the removal of that file.

mkavulich commented 4 months ago

@MichaelLueken I have run into a problem on Derecho that I'm unable to solve: this occurred both in my preliminary branch and your current branch. It has to do with the installation of the srw conda packages:

error    libmamba Bad conversion of Python version '3.10.12': filesystem error: temp_directory_path: No such file or directory
./Miniforge3-Linux-x86_64.sh: line 339: 109438 Segmentation fault      (core dumped) CONDA_SAFETY_CHECKS=disabled CONDA_EXTRA_SAFETY_CHECKS=no CONDA_CHANNELS="conda-forge" CONDA_PKGS_DIRS="$PREFIX/pkgs" "$CONDA_EXEC" install --offline --file "$PREFIX/pkgs/env.txt" -yp "$PREFIX"
./devbuild.sh: line 228: conda/etc/profile.d/conda.sh: No such file or directory
./devbuild.sh: line 233: conda: command not found
./devbuild.sh: line 234: conda: command not found
./devbuild.sh: line 235: mamba: command not found
./devbuild.sh: line 237: conda: command not found
./devbuild.sh: line 238: mamba: command not found

I assume we'll need to enlist the help of the unified workflow team on this one? Or it could be related to the updated spack-stack build. Regardless, I'll wait to see if you can replicate the problem to make sure it's not just a problem with my environment.

MichaelLueken commented 4 months ago

@mkavulich I have just cloned a fresh copy of the feature/hash_update branch on Derecho and I was able to successfully build the App using ./devbuild.sh -p=derecho:

[100%] Built target ufs-weather-model
Install the project...
-- Install configuration: "RELEASE"
-- Installing: /glade/derecho/scratch/mlueken/ufs-srweather-app/derecho/exec/ufs_srweather_app.settings
mlueken@derecho6:/glade/derecho/scratch/mlueken/ufs-srweather-app/derecho>
mkavulich commented 4 months ago

@MichaelLueken Did you check that the conda package installed correctly as well? The code actually builds successfully for me, it's the conda package that fails to install.

MichaelLueken commented 4 months ago

@mkavulich Yes, the conda package was correctly installed. Both the srw_app and srw_graphics conda environments were also created. The fundamental tests were ran and all passed successfully.

My working copy on Derecho can be found - /glade/derecho/scratch/mlueken/ufs-srweather-app/derecho. The fundamental test results can be found /glade/derecho/scratch/mlueken/ufs-srweather-app/expt_dirs.

mkavulich commented 4 months ago

Thanks for confirming, and sorry for cluttering up the PR with my own issues. I did confirm that it works on Hera, so I'll continue my testing there while I try to figure out this Derecho issue.

Edit: for future reference, this conda error was caused by my environment containing the environment variable TMPDIR which was pointing to a non-existent directory. This is the issue that helped me solve it: https://github.com/conda-forge/miniforge/issues/474

chan-hoo commented 3 months ago

@MichaelLueken, I was able to build the app on Derecho successfully. However, after yesterday's PM, it fails on Hera with the following message:

Error running link command: Segmentation fault
make[5]: *** [FV3/ccpp/framework/src/CMakeFiles/ccpp_framework.dir/build.make:97: FV3/ccpp/framework/src/libccpp_framework.a] Error 1
make[5]: *** Deleting file 'FV3/ccpp/framework/src/libccpp_framework.a'
make[4]: *** [CMakeFiles/Makefile2:469: FV3/ccpp/framework/src/CMakeFiles/ccpp_framework.dir/all] Error 2
make[4]: *** Waiting for unfinished jobs....

This may be a system issue. Do you have any idea what happens there?

MichaelLueken commented 3 months ago

Thanks for the review, @chan-hoo! At the moment, I suspect that the issue is due to Rocky8 transition. Once PR #1054 is merged, I will update my feature/hash_update branch to the HEAD of develop. Hopefully this is all that should be required.

MichaelLueken commented 3 months ago

@chan-hoo -

I have updated my branch to the HEAD of develop. The SRW is successfully building once again on the default Rocky front ends:

[100%] Completed 'ufs-weather-model'
[100%] Built target ufs-weather-model
Install the project...
-- Install configuration: "RELEASE"
-- Installing: /scratch2/NAGAPE/epic/Michael.Lueken/ufs-srweather-app/hera/exec/ufs_srweather_app.settings

Please let me know if you continue to encounter issues while compiling or running on Hera.

chan-hoo commented 3 months ago

@MichaelLueken, it works well now. :) Thanks!

MichaelLueken commented 3 months ago

The update made to aqm_environment.yml was enough to allow the AQM WE2E test to successfully run on Hera Rocky8:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240320175754                   COMPLETE            4890.58
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            4890.58
MichaelLueken commented 3 months ago

The Jenkins runner on Hera appears to have connected to a CentOS front end when maintenance concluded. The Jenkins tests on Hera failed to compile the SRW App. Reached out to the Platform Team via PSD-85 to request that they connect to a Rocky8 front end.

Additionally, on Hera and Jet, the Functional WorkflowTaskTests are failing in run_fcst with the following error message:

FATAL from PE 0: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

Further investigation is necessary to see why the tests are running fine on Derecho, Hercules, and Orion, but not on Hera and Jet (Why is PBSpro okay, but not Slurm?). Additionally, why is there an issue with the wrapper scripts, but not when run as part of the workflow?

The WE2E coverage tests were manually launched on Hera Intel and successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240321180137                            COMPLETE              31.36
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024032  COMPLETE               6.84
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE            1515.46
get_from_HPSS_ics_HRRR_lbcs_RAP_20240321180143                     COMPLETE              14.76
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.43
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              14.16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240321180147  COMPLETE              10.66
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               7.74
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             447.16
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240321  COMPLETE             587.99
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403211  COMPLETE            1024.58
pregen_grid_orog_sfc_climo_20240321180154                          COMPLETE               7.67
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            3675.81
MichaelLueken commented 3 months ago

As part of this PR, I removed the use of the PET list and added back in the original atmos_nthreads capability. This works fine while running the workflow and using the wrapper scripts (Functional WorkflowTaskTests - wrapper_srw_ftest.sh) on systems that use PBSPro, but the wrapper scripts on systems that use Slurm are failing due to bad PET list bounds. Adding back in the PET list allows the wrapper scripts to pass on Slurm systems, but cause the workflow to fail in run_fcst.

I'm now doing a deep dive to see if the PET list aspect of the weather model has been updated without an update to the documentation.

MichaelLueken commented 3 months ago

I had been updating the pe_member01_m1 entry in ufs.configure, rather than updating the PE_MEMBER01 entry in config_defaults.yaml. This was leading to the correct output in the ufs.configure file, but an incorrect value for PE_MEMBER01, leading to failure. Applying the necessary update to PE_MEMBER01 is now allowing the majority of the fundamental tests to properly run.

Unfortunately, the grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR WE2E test is still failing. The failure message is:

FATAL from PE 6: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

It appears as though the modification to DT_ATMOS, LAYOUT_X, LAYOUT_Y, or BLOCKSIZE is having an adverse effect on the test. Will investigate further.

MichaelLueken commented 3 months ago

The Jenkins tests successfully passed. After addressing conflicts, the AQM WE2E test was ran one last time and also successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240327143007                   COMPLETE            4865.32
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            4865.32

Merging this PR now.