hafs-community / HAFS

Hurricane Analysis and Forecast System
Other
39 stars 58 forks source link

Current Quickstart guide instructions do not work on Jet #297

Open mkavulich opened 1 month ago

mkavulich commented 1 month ago

Description

Running the instructions in the Quick Start chapter of the Users Guide fails to run to completion on Jet for the develop branch. The failure occurs in the forecast step: the PET* files indicate there's some problem with the PETlist. There are also a lot of broken links in the forecast directory; not sure if that's related or a separate problem.

In addition, the instructions under XML File to Run the Workflow instruct you to open the wrong file (vi system.conf instead of vi hafs_workflow.xml.in, presumably a copy-paste error from the previous section).

To Reproduce:

  1. Follow the build and run instructions in the Quick Start guide from the develop branch
  2. Observe that the forecast step fails

Additional context (optional)

Running the first regression test from ./cronjob_hafs_rt.sh (suggested by @mrinalbiswas) succeeds with the same environment and settings, so it is not a problem with the environment.

Output (optional)

The job seems to fail almost immediately after starting the executable running, apparently due to a problem with the PETlist

output logs In /mnt/lfs5/HFIP/dtc-hurr/Michael.Kavulich/HAFS/test_instructions/hafstmp/HAFS/2020082512/13L/forecast/PET0000.ESMF_LogFile:

20241025 144712.109 ERROR            PET0000 ESMF_Comp.F90:758 ESMF_CompConstruct Value unrecognized or out of range  - Conflict between petlist and global pet count
20241025 144712.110 ERROR            PET0000 ESMF_GridComp.F90:568 ESMF_GridCompCreate Value unrecognized or out of range  - Internal subroutine call returned Error
20241025 144712.110 ERROR            PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:4627 Value unrecognized or out of range  - Passing error in return code
20241025 144712.110 ERROR            PET0000 UFSDriver.F90:392 Value unrecognized or out of range  - Passing error in return code
20241025 144712.110 ERROR            PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:794 Value unrecognized or out of range  - Passing error in return code
20241025 144712.110 ERROR            PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Value unrecognized or out of range  - Passing error in return code
20241025 144712.110 ERROR            PET0000 UFS.F90:386 Value unrecognized or out of range  - Aborting UFS
20241025 144712.110 INFO             PET0000 Finalizing ESMF

In /mnt/lfs5/HFIP/dtc-hurr/Michael.Kavulich/HAFS/test_instructions/hafstmp/HAFS/2020082512/13L/hafs_forecast.log:

+ 31 + source prep_step
++ 31 + '[' -n '' ']'
++ 31 + '[' -f errfile ']'
++ 31 + export FORT01=0
++ 31 + FORT01=0
+++ 31 + env
+++ 31 + grep '^FORT[0-9]\{1,\}='
+++ 31 + awk -F= '{print $1}'
++ 31 + unset FORT01
+ 31 + tee forecast.log
+ 31 + srun --mem=0 --ntasks=1080 --ntasks-per-node=12 --cpus-per-task=2 ./hafs_forecast.x

* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
     PROGRAM ufs       HAS BEGUN. COMPILED       0.00     ORG: np23
     STARTING DATE-TIME  OCT 25,2024  14:47:11.899  299  FRI   2460609

Abort(1) on node 729 (rank 729 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 729
Abort(1) on node 169 (rank 169 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 169
Abort(1) on node 353 (rank 353 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 353
Abort(1) on node 698 (rank 698 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 698
Abort(1) on node 647 (rank 647 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 647
Abort(1) on node 978 (rank 978 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 978
Abort(1) on node 631 (rank 631 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 631
Abort(1) on node 236 (rank 236 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 236
Abort(1) on node 817 (rank 817 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 817
Abort(1) on node 1039 (rank 1039 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 1039