ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
59 stars 119 forks source link

CRTM issues encountered on new platforms (Derecho, Hercules, and Gaea C5) #916

Open MichaelLueken opened 1 year ago

MichaelLueken commented 1 year ago

Expected behavior

Updating the UFS-WM hash to the version associated with PR #1823, is causing the SRW App to either not run or fail on Derecho, Hercules, and Gaea C5. This hash updated the UPP to 520cc23, which requires changing the postxconfig-NT-fv3lam.txt post configuration file to postxconfig-NT-fv3lam_rrfs.txt (postxconfig-NT-fv3lam.txt was removed from the UPP repository). The new postxconfig-NT-fv3lam_rrfs.txt file includes simulated radiances, which means that the CRTM needs to be run and CRTM coefficients need to be made available.

Changes which were made to use the updated hashes must successfully run on all of the new platforms (Derecho, Hercules, Gaea C5).

Current behavior

On Hercules and Gaea C5, while the path that would normally contain the CRTM coefficients are present, there are no fix files available:

FileNotFoundError: 
USE_CRTM has been set, but the external CRTM fix file directory:
CRTM_DIR = /work/noaa/epic/role-epic/contrib/hercules/hpc-stack/intel-2022.2.1/intel-oneapi-compilers-2022.2.1/intel-oneapi-mpi-2021.7.1/crtm/2.4.0/fix
could not be found.
FileNotFoundError: 
USE_CRTM has been set, but the external CRTM fix file directory:
CRTM_DIR = /lustre/f2/dev/role.epic/contrib/C5/hpc-stack/intel-classic-2023.1.0/intel-classic-2023.1.0/cray-mpich-8.1.25/crtm/2.4.0/fix
could not be found.

On Derecho, both inline and offline post are failing in the CRTM with the following error message:

forrtl: severe (122): invalid attempt to assign into a pointer that is not associated
Image              PC                Routine            Line        Source
ufs_model          00000000042221E5  crtm_forward_modu         356  CRTM_Forward_Module.f90
libiomp5.so        000014F8970FB053  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        000014F897069A64  __kmp_fork_call       Unknown  Unknown
libiomp5.so        000014F897023223  __kmpc_fork_call      Unknown  Unknown
ufs_model          0000000004221C83  crtm_forward_modu         353  CRTM_Forward_Module.f90
ufs_model          0000000003E8123B  calrad_wcloud_           1725  CALRAD_WCLOUD_newcrtm.f

Machines affected

Derecho, Hercules, and Gaea C5

Steps To Reproduce

  1. Clone my branch, feature/upp_2d_decomp, git clone -b feature/upp_2d_decomp git@github.com:MichaelLueken/ufs-srweather-app.git
  2. No changes are necessary to run on Derecho - compile and move on to step 5 below. Please follow steps 3 and 4 for Hercules and Gaea C5.
  3. Merge Natalie's develop_hercules and develop_gaea_c5 branches into my branch
  4. Add the paths for CRTM_DIR into the ush/machine/hercules|gaea_c5.yamlfiles:
    • Hercules - /work/noaa/epic/role-epic/contrib/hercules/hpc-stack/intel-2022.2.1/intel-oneapi-compilers-2022.2.1/intel-oneapi-mpi-2021.7.1/crtm/2.4.0/fix
    • Gaea C5 - /lustre/f2/dev/role.epic/contrib/C5/hpc-stack/intel-classic-2023.1.0/intel-classic-2023.1.0/cray-mpich-8.1.25/crtm/2.4.0/fix
  5. Run the fundamental WE2E test suite, ./run_WE2E_tests.py -t fundamental -m derecho|hercules|gaea_c5 -a NRAL0032|epic

Detailed Description of Fix (optional)

mkavulich commented 1 year ago

I assume that the problems you are seeing on Derecho are a bug in UPP, it's probably worth opening an issue in that repository as well.

MichaelLueken commented 1 year ago

Thanks, @mkavulich! I have opened issue #789 in the UPP repository.

natalie-perlin commented 1 year ago

@MichaelLueken - I'll take care of verifying fix coefficients are installed in current locations on these platforms.

MichaelLueken commented 1 year ago

Thanks, @natalie-perlin!

MichaelLueken commented 1 year ago

@natalie-perlin - Out of curiosity, while running the GSI regression tests on Cheyenne, did you encounter any issues with the CRTM?

The fact that the post is failing in the CRTM's forward model only on Derecho, suggests that there might be something extra that needs to be done while compiling the CRTM on the machine. If you had to make changes to the CRTM build on Cheyenne to allow the GSI regression tests to run on the machine, then the same changes will likely need to be made to allow the post to work on Derecho.

natalie-perlin commented 1 year ago

@MichaelLueken - CRTM fix files are now in a correct location on Hercules and Gaea C5 and Derecho. What is the best way to test that issues are resolved?

MichaelLueken commented 1 year ago

@natalie-perlin - I'll attempt running the fundamental tests on Hercules, Gaea C5, and Derecho using my fork's feature/upp_2d_decomp branch. If you would like to try testing as well, you should be able to clone my branch that contains all the necessary changes for Hercules, Gaea C5, and Derecho by using the following command:

git clone -b feature/upp_2d_decomp git@github.com:MichaelLueken/ufs-srweather-app.git

natalie-perlin commented 1 year ago

@MichaelLueken - testing for Derecho now

MichaelLueken commented 1 year ago

@natalie-perlin - I can confirm that the SRW App successfully runs using the newly added CRTM coefficients on Hercules:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              19.09
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              17.79
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              21.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              26.60
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              46.05
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              27.23
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              44.40
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             202.44

and on Gaea C5:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              20.28
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              29.75
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              24.86
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              30.16
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR          COMPLETE              41.27
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0              COMPLETE              34.28
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              52.63
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             233.23

I'm finding that Derecho is still failing with in the post with the following error message in the CRTM_Forward_Module:

forrtl: severe (122): invalid attempt to assign into a pointer that is not associated

natalie-perlin commented 1 year ago

@MichaelLueken - yes, the experiments are still failing on Derecho. I've done the following in attempt to correct the issue:

MichaelLueken commented 1 year ago

@natalie-perlin - With the error message in the logs, I don't think that the issue is with the coefficient files on Derecho. Looking at line 356 in CRTM_Forward_Module.f90, I see the following: Opt = Default_Options

In the function, both Default_Options and Opt are declared as typed arrays, but not pointers: TYPE(CRTM_Options_type) :: Default_Options, Opt

It is unclear to me why the CRTM is failing due to an invalid attempt to assign into a pointer that is not associated, since neither are pointers.

At this point, I would recommend attempting to rebuild the CRTM library. I will also reach out to Ben Johnson. I believe he is still the code manager for the CRTM at JCSDA, and see if he has encountered this type of issue before. I will CC you as well so that you are kept in the loop.

MichaelLueken commented 1 year ago

@natalie-perlin - It looks like Ben's suggestion would be to go into CRTM_Forward_Module.f90 and make the following changes:

Unfortunately, I don't have access to the epicufsrt account on Cheyenne/Derecho. If you would like, please let me know once the modifications have been made, then I will look over the changes before you rebuild the CRTM.

natalie-perlin commented 1 year ago

@MichaelLueken - this suggested fix did not seem to make a difference. Similarly, crtm/2.4.0 downloaded from JCSDA/crtm and built with or without the "Opt" fix - all resulted in the same error. In summary, things tested:

MichaelLueken commented 1 year ago

@natalie-perlin - Hopefully Ben can think of something else to try. It didn't look like the issue was with Post_Process_RTSolution, since the failure appears to be happening before any calls to that subroutine. For as nicely documented and cleanly coded the CRTM is, it is prone to compiler issues.

MichaelLueken commented 1 year ago

To circumvent this issue and to allow the ufs-weather-model and UPP hashes to be brought up-to-date, the use of the postxconfig-NT-fv3lam.txt file found in ufs-weather-model/tests/parm, will be used in lieu of postxconfig-NT-fv3lam_rrfs.txt. Once the SRW App transitions to spack-stack on Derecho, hopefully the CRTM issue will be corrected and we can move forward with using the postxconfig-NT-fv3lam_rrfs.txt file from UPP.