Running the instructions in the Quick Start chapter of the Users Guide fails to run to completion on Jet for the develop branch. The failure occurs in the forecast step: the PET* files indicate there's some problem with the PETlist. There are also a lot of broken links in the forecast directory; not sure if that's related or a separate problem.
In addition, the instructions under XML File to Run the Workflow instruct you to open the wrong file (vi system.conf instead of vi hafs_workflow.xml.in, presumably a copy-paste error from the previous section).
To Reproduce:
Follow the build and run instructions in the Quick Start guide from the develop branch
Observe that the forecast step fails
Additional context (optional)
Running the first regression test from ./cronjob_hafs_rt.sh (suggested by @mrinalbiswas) succeeds with the same environment and settings, so it is not a problem with the environment.
Output (optional)
The job seems to fail almost immediately after starting the executable running, apparently due to a problem with the PETlist
output logs
In /mnt/lfs5/HFIP/dtc-hurr/Michael.Kavulich/HAFS/test_instructions/hafstmp/HAFS/2020082512/13L/forecast/PET0000.ESMF_LogFile:
20241025 144712.109 ERROR PET0000 ESMF_Comp.F90:758 ESMF_CompConstruct Value unrecognized or out of range - Conflict between petlist and global pet count
20241025 144712.110 ERROR PET0000 ESMF_GridComp.F90:568 ESMF_GridCompCreate Value unrecognized or out of range - Internal subroutine call returned Error
20241025 144712.110 ERROR PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:4627 Value unrecognized or out of range - Passing error in return code
20241025 144712.110 ERROR PET0000 UFSDriver.F90:392 Value unrecognized or out of range - Passing error in return code
20241025 144712.110 ERROR PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:794 Value unrecognized or out of range - Passing error in return code
20241025 144712.110 ERROR PET0000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:483 Value unrecognized or out of range - Passing error in return code
20241025 144712.110 ERROR PET0000 UFS.F90:386 Value unrecognized or out of range - Aborting UFS
20241025 144712.110 INFO PET0000 Finalizing ESMF
In /mnt/lfs5/HFIP/dtc-hurr/Michael.Kavulich/HAFS/test_instructions/hafstmp/HAFS/2020082512/13L/hafs_forecast.log:
+ 31 + source prep_step
++ 31 + '[' -n '' ']'
++ 31 + '[' -f errfile ']'
++ 31 + export FORT01=0
++ 31 + FORT01=0
+++ 31 + env
+++ 31 + grep '^FORT[0-9]\{1,\}='
+++ 31 + awk -F= '{print $1}'
++ 31 + unset FORT01
+ 31 + tee forecast.log
+ 31 + srun --mem=0 --ntasks=1080 --ntasks-per-node=12 --cpus-per-task=2 ./hafs_forecast.x
* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
PROGRAM ufs HAS BEGUN. COMPILED 0.00 ORG: np23
STARTING DATE-TIME OCT 25,2024 14:47:11.899 299 FRI 2460609
Abort(1) on node 729 (rank 729 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 729
Abort(1) on node 169 (rank 169 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 169
Abort(1) on node 353 (rank 353 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 353
Abort(1) on node 698 (rank 698 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 698
Abort(1) on node 647 (rank 647 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 647
Abort(1) on node 978 (rank 978 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 978
Abort(1) on node 631 (rank 631 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 631
Abort(1) on node 236 (rank 236 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 236
Abort(1) on node 817 (rank 817 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 817
Abort(1) on node 1039 (rank 1039 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 1039
Description
Running the instructions in the Quick Start chapter of the Users Guide fails to run to completion on Jet for the develop branch. The failure occurs in the forecast step: the PET* files indicate there's some problem with the PETlist. There are also a lot of broken links in the
forecast
directory; not sure if that's related or a separate problem.In addition, the instructions under XML File to Run the Workflow instruct you to open the wrong file (
vi system.conf
instead ofvi hafs_workflow.xml.in
, presumably a copy-paste error from the previous section).To Reproduce:
Additional context (optional)
Running the first regression test from
./cronjob_hafs_rt.sh
(suggested by @mrinalbiswas) succeeds with the same environment and settings, so it is not a problem with the environment.Output (optional)
The job seems to fail almost immediately after starting the executable running, apparently due to a problem with the PETlist
output logs In
/mnt/lfs5/HFIP/dtc-hurr/Michael.Kavulich/HAFS/test_instructions/hafstmp/HAFS/2020082512/13L/forecast/PET0000.ESMF_LogFile
:In
/mnt/lfs5/HFIP/dtc-hurr/Michael.Kavulich/HAFS/test_instructions/hafstmp/HAFS/2020082512/13L/hafs_forecast.log
: