hafs-community / HAFS

Hurricane Analysis and Forecast System
Other
29 stars 54 forks source link

Re-enable HAFS develop branch support for Orion after Rocky 9 OS upgrade #279

Closed BijuThomas-NOAA closed 2 months ago

BijuThomas-NOAA commented 2 months ago

Description of changes

Restore HAFS support for Orion after its Rocky 9 OS upgrade.

Issues addressed (optional)

If this PR addresses one or more issues, please provide link(s) to the issue(s) here.

Tests conducted

Technical testing has been done on Orion.

Application-level regression test status

Running the HAFS application-level regression tests is currently performed by code reviewers after the developer creates the initial PR. As regression tests are conducted, the testers should use the checklist below to indicate successful regression tests. You may add other tests as needed. If a test fails, do not check the box. Instead, describe the failure in the PR comments, noting the platform where the test failed.

BinLiu-NOAA commented 2 months ago

@LinZhu-NOAA, @mrinalbiswas, @nriveratorres-NOAA, @ChuankaiWang-NOAA, @JunghoonShin-NOAA, could you please help to conduct the HAFS application regression tests for the following platforms, respectively?

Thanks!

P.S., git clone -b feature/hafs_orion_rocky9 --recursive https://github.com/BijuThomas-NOAA/HAFS.git ./

JunghoonShin-NOAA commented 2 months ago

Regression tests for the feature/hafs_orion_rocky9 branch completed successfully on Jet. Thank you.

ChuankaiWang-NOAA commented 2 months ago

RT passed on Hercules.

nriveratorres-NOAA commented 2 months ago

The analysis task failed for the following regression tests on Orion:

hafs_orion_rocky9_20240628_rt_hfsa_dev_ww3 hafs_orion_rocky9_20240628_rt_hfsb_dev hafs_orion_rocky9_20240628_rt_regional_static_C192s1n4_atm_3denvar

The log files are in the following directories /work/noaa/hwrf/scrub/natrt/hafs_orion_rocky9_20240628_rt_hfsa_dev_ww3/2020082506/13L /work/noaa/hwrf/scrub/natrt/hafs_orion_rocky9_20240628_rt_hfsb_dev/2020082506/13L /work/noaa/hwrf/scrub/natrt/hafs_orion_rocky9_20240628_rt_regional_static_C192s1n4_atm_3denvar/2020082512/00L

The log files contain the following error messages:

-- FATAL ERROR: /work/noaa/hwrf/noscrub/hafs-input/COMGFSv16/gfs.20200825/06/atmos//gfs.t06z.prepbufr does not exist or is empty. Exiting ... -- FATAL ERROR: /work/noaa/hwrf/noscrub/hafs-input/COMGFSv16/gfs.20200825/12/atmos//gfs.t12z.prepbufr does not exist or is empty. Exiting ...

LinZhu-NOAA commented 2 months ago

Regression tests completed successfully on WCOSS2.

nriveratorres-NOAA commented 2 months ago

The analysis task failed for the following regression tests using Orion: hafs_orion_rocky9_20240628_rt_hfsa_dev_ww3 hafs_orion_rocky9_20240628_rt_hfsb_dev hafs_orion_rocky9_20240628_rt_regional_static_C192s1n4_atm_3denvar

From hafs_analysis.log file:

GENSTATS_GPS: no profiles to process (nprof_gfs= 0 ), EXIT routine SpcCoeff_ReadFile(Binary)(FAILURE) : Error reading channel data. input statement requires too much data, unit 11, file /work/noaa/hwrf/scrub/natrt/hafs_orion_rocky9_20240628_rt_hfsa_dev_ww3/2020082512/13L/analysis_d02/./hirs4_n19.SpcCoeff.bin CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./hirs4_n19.SpcCoeff.bin; Process ID: 269 CRTM_Init(FAILURE) : Error loading SpcCoeff data; Process ID: 269 crtm_interface*init_crtm: ERROR crtm_init error_status= 3 TERMINATE PROGRAM EXECUTION Abort(71) on node 269 (rank 269 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 71) - process 269 SpcCoeff_ReadFile(Binary)(FAILURE) : Error reading channel data. input statement requires too much data, unit 11, file /work/noaa/hwrf/scrub/natrt/hafs_orion_rocky9_20240628_rt_hfsa_dev_ww3/2020082512/13L/analysis_d02/./hirs4_n19.SpcCoeff.bin CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./hirs4_n19.SpcCoeff.bin; Process ID: 321 CRTM_Init(FAILURE) : Error loading SpcCoeff data; Process ID: 321 SpcCoeff_ReadFile(Binary)(FAILURE) : Error reading channel data. input statement requires too much data, unit 11, file /work/noaa/hwrf/scrub/natrt/hafs_orion_rocky9_20240628_rt_hfsa_dev_ww3/2020082512/13L/analysis_d02/./hirs4_n19.SpcCoeff.bin CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./hirs4_n19.SpcCoeff.bin; Process ID: 323 CRTM_Init(FAILURE) : Error loading SpcCoeff data; Process ID: 323 crtm_interface*init_crtm: ERROR crtm_init error_status= 3 TERMINATE PROGRAM EXECUTION Abort(71) on node 321 (rank 321 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 71) - process 321 crtm_interface*init_crtm: ERROR crtm_init error_status= 3 TERMINATE PROGRAM EXECUTION Abort(71) on node 323 (rank 323 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 71) - process 323 slurmstepd: error: STEP 18308553.0 ON orion-23-17 CANCELLED AT 2024-07-01T16:20:20 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

BinLiu-NOAA commented 2 months ago

@BijuThomas-NOAA With @nriveratorres-NOAA HAFS RT testing errors on Orion, looks like we cannot switch to use crtm/2.4.0.1 for the analysis/GSI job yet. How about changing it back to crtm/2.4.0 like local crtm_ver=os.getenv("crtm_ver") or "2.4.0" in: https://github.com/hafs-community/GSI/blob/bbdb610d71e0a66cdc8cabc7bea47b66ee070331/modulefiles/gsi_hercules.lua for now on Orion?

BijuThomas-NOAA commented 2 months ago

@nriveratorres-NOAA and @BinLiu-NOAA Please update with the following files and repeat the failed RTs

/work/noaa/hwrf/save/bthomas/hafs_dev_bt/parm/analysis/gsi/gsiparm.anl.tmp /work/noaa/hwrf/save/bthomas/hafs_dev_bt/scripts/exhafs_analysis.sh

BijuThomas-NOAA commented 2 months ago

@nriveratorres-NOAA After discussing with @BinLiu-NOAA, we can just revert the crtm to 2.4.0 for this PR to fix the Analysis job failures. Here are the steps:

cd /work/noaa/hwrf/save/natrt/hafs_orion_rocky9_20240628
git pull origin feature/hafs_orion_rocky9
git submodule update --init --recursive
cd sorc
./build_gsi.sh  
./install_all.sh

Then cd /work/noaa/hwrf/save/natrt/hafs_orion_rocky9_20240628/rocoto, rewind the failed Analysis jobs and resubmit them. Please let me know if you have any issues. Thanks,

nriveratorres-NOAA commented 2 months ago

Regression tests completed successfully on Orion

BinLiu-NOAA commented 2 months ago

Thanks a lot, @JunghoonShin-NOAA, @ChuankaiWang-NOAA, @LinZhu-NOAA, and @nriveratorres-NOAA! Since Hera is under maintenance today, we will skip the RT on Hera. With that, this PR is ready for merge now.

mrinalbiswas commented 2 months ago

@BinLiu-NOAA RTs passed on Hera.