Open MichaelLueken opened 1 year ago
I assume that the problems you are seeing on Derecho are a bug in UPP, it's probably worth opening an issue in that repository as well.
Thanks, @mkavulich! I have opened issue #789 in the UPP repository.
@MichaelLueken - I'll take care of verifying fix coefficients are installed in current locations on these platforms.
Thanks, @natalie-perlin!
@natalie-perlin - Out of curiosity, while running the GSI regression tests on Cheyenne, did you encounter any issues with the CRTM?
The fact that the post is failing in the CRTM's forward model only on Derecho, suggests that there might be something extra that needs to be done while compiling the CRTM on the machine. If you had to make changes to the CRTM build on Cheyenne to allow the GSI regression tests to run on the machine, then the same changes will likely need to be made to allow the post to work on Derecho.
@MichaelLueken - CRTM fix files are now in a correct location on Hercules and Gaea C5 and Derecho. What is the best way to test that issues are resolved?
@natalie-perlin - I'll attempt running the fundamental tests on Hercules, Gaea C5, and Derecho using my fork's feature/upp_2d_decomp
branch. If you would like to try testing as well, you should be able to clone my branch that contains all the necessary changes for Hercules, Gaea C5, and Derecho by using the following command:
git clone -b feature/upp_2d_decomp git@github.com:MichaelLueken/ufs-srweather-app.git
@MichaelLueken - testing for Derecho now
@natalie-perlin - I can confirm that the SRW App successfully runs using the newly added CRTM coefficients on Hercules:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 19.09
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 17.79
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 COMPLETE 21.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 26.60
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR COMPLETE 46.05
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0 COMPLETE 27.23
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 COMPLETE 44.40
----------------------------------------------------------------------------------------------------
Total COMPLETE 202.44
and on Gaea C5:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta COMPLETE 20.28
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_ COMPLETE 29.75
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 COMPLETE 24.86
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot COMPLETE 30.16
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR COMPLETE 41.27
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0 COMPLETE 34.28
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16 COMPLETE 52.63
----------------------------------------------------------------------------------------------------
Total COMPLETE 233.23
I'm finding that Derecho is still failing with in the post with the following error message in the CRTM_Forward_Module:
forrtl: severe (122): invalid attempt to assign into a pointer that is not associated
@MichaelLueken - yes, the experiments are still failing on Derecho. I've done the following in attempt to correct the issue:
Tared up the directory on Hercules/Orion with CRTM fix coefficients that are used in successful runs of fundamental tests on Hercules, /work/noaa/epic/role-epic/contrib/orion/hpc-stack/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/crtm/2.4.0/fix/, and ported/untared them in corresponding location for Derecho, /glade/work/epicufsrt/contrib/derecho/hpc-stack/intel-classic-2023.0.0/intel-classic-2023.0.0/cray-mpich-8.1.25/crtm/2.4.0/fix
In all the cases the tests failed. Any ideas on what could be the next step to debug the issue?..
@natalie-perlin -
With the error message in the logs, I don't think that the issue is with the coefficient files on Derecho. Looking at line 356 in CRTM_Forward_Module.f90
, I see the following:
Opt = Default_Options
In the function, both Default_Options
and Opt
are declared as typed arrays, but not pointers:
TYPE(CRTM_Options_type) :: Default_Options, Opt
It is unclear to me why the CRTM is failing due to an invalid attempt to assign into a pointer that is not associated, since neither are pointers.
At this point, I would recommend attempting to rebuild the CRTM library. I will also reach out to Ben Johnson. I believe he is still the code manager for the CRTM at JCSDA, and see if he has encountered this type of issue before. I will CC you as well so that you are kept in the loop.
@natalie-perlin - It looks like Ben's suggestion would be to go into CRTM_Forward_Module.f90 and make the following changes:
Opt
- CALL Post_Process_RTSolution(Opt,RTSolution(ln,m), &
Opt
- CALL Post_Process_RTSolution(Opt,RTSolution(ln,m), &
Opt
- SUBROUTINE Post_Process_RTSolution(Opt,rts, &
TYPE(CRTM_Options_Type), INTENT(IN) :: Opt
Unfortunately, I don't have access to the epicufsrt account on Cheyenne/Derecho. If you would like, please let me know once the modifications have been made, then I will look over the changes before you rebuild the CRTM.
@MichaelLueken - this suggested fix did not seem to make a difference. Similarly, crtm/2.4.0 downloaded from JCSDA/crtm and built with or without the "Opt" fix - all resulted in the same error. In summary, things tested:
@natalie-perlin - Hopefully Ben can think of something else to try. It didn't look like the issue was with Post_Process_RTSolution, since the failure appears to be happening before any calls to that subroutine. For as nicely documented and cleanly coded the CRTM is, it is prone to compiler issues.
To circumvent this issue and to allow the ufs-weather-model and UPP hashes to be brought up-to-date, the use of the postxconfig-NT-fv3lam.txt
file found in ufs-weather-model/tests/parm
, will be used in lieu of postxconfig-NT-fv3lam_rrfs.txt
. Once the SRW App transitions to spack-stack on Derecho, hopefully the CRTM issue will be corrected and we can move forward with using the postxconfig-NT-fv3lam_rrfs.txt
file from UPP.
Expected behavior
Updating the UFS-WM hash to the version associated with PR #1823, is causing the SRW App to either not run or fail on Derecho, Hercules, and Gaea C5. This hash updated the UPP to 520cc23, which requires changing the
postxconfig-NT-fv3lam.txt
post configuration file topostxconfig-NT-fv3lam_rrfs.txt
(postxconfig-NT-fv3lam.txt
was removed from the UPP repository). The newpostxconfig-NT-fv3lam_rrfs.txt
file includes simulated radiances, which means that the CRTM needs to be run and CRTM coefficients need to be made available.Changes which were made to use the updated hashes must successfully run on all of the new platforms (Derecho, Hercules, Gaea C5).
Current behavior
On Hercules and Gaea C5, while the path that would normally contain the CRTM coefficients are present, there are no fix files available:
On Derecho, both inline and offline post are failing in the CRTM with the following error message:
Machines affected
Derecho, Hercules, and Gaea C5
Steps To Reproduce
feature/upp_2d_decomp
,git clone -b feature/upp_2d_decomp git@github.com:MichaelLueken/ufs-srweather-app.git
develop_hercules
anddevelop_gaea_c5
branches into my branchush/machine/hercules|gaea_c5.yaml
files:/work/noaa/epic/role-epic/contrib/hercules/hpc-stack/intel-2022.2.1/intel-oneapi-compilers-2022.2.1/intel-oneapi-mpi-2021.7.1/crtm/2.4.0/fix
/lustre/f2/dev/role.epic/contrib/C5/hpc-stack/intel-classic-2023.1.0/intel-classic-2023.1.0/cray-mpich-8.1.25/crtm/2.4.0/fix
./run_WE2E_tests.py -t fundamental -m derecho|hercules|gaea_c5 -a NRAL0032|epic
Detailed Description of Fix (optional)