oceanmodeling / ondemand-storm-workflow

Other
2 stars 1 forks source link

spinup run failed due to issue with the boundary node #33

Closed FariborzDaneshvar-NOAA closed 11 months ago

FariborzDaneshvar-NOAA commented 11 months ago

@SorooshMani-NOAA here is the error message I got for the spinup run of irene with the BEST track (run directory on NHC_COLAB_2: /lustre/hurricanes/irene_2011_05f1c18a-5bd7-471f-bfd2-66b66655cb8b):

part 1:

+ pushd /lustre/hurricanes/irene_2011_05f1c18a-5bd7-471f-bfd2-66b66655cb8b/setup/ensemble.dir/spinup
/lustre/hurricanes/irene_2011_05f1c18a-5bd7-471f-bfd2-66b66655cb8b/setup/ensemble.dir/spinup ~/ondemand-storm-workflow/singularity/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images//solve.sif pschism_PAHM_TVD-VL 4
   1: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
   3: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   2: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                  
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[sorooshmani-nhccolab2-00011-1-0005:17146] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
FariborzDaneshvar-NOAA commented 11 months ago

part 2 (cont.):

   5: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   6: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   7: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   8: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   9: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  10: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  11: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  12: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  13: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  14: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  15: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  16: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  17: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  18: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  19: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  20: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  21: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  22: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  23: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  24: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  25: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  26: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  27: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  28: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  29: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  30: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  31: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   0: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
   4: ABORT:  Illegal bnd node      368218          -1           0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
[sorooshmani-nhccolab2-00011-1-0005:17146] 31 more processes have sent help message help-mpi-api.txt / mpi-abort
[sorooshmani-nhccolab2-00011-1-0005:17146] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
+ '[' 0 -eq 0 ']'
+ echo 'Combining outputs...'
Combining outputs...
+ date
Thu Sep 21 17:50:00 UTC 2023
+ pushd outputs
/lustre/hurricanes/irene_2011_05f1c18a-5bd7-471f-bfd2-66b66655cb8b/setup/ensemble.dir/spinup/outputs /lustre/hurricanes/irene_2011_05f1c18a-5bd7-471f-bfd2-66b66655cb8b/setup/ensemble.dir/spinup ~/ondemand-storm-workflow/singularity/scripts
+ ls 'hotstart*'
+ popd
/lustre/hurricanes/irene_2011_05f1c18a-5bd7-471f-bfd2-66b66655cb8b/setup/ensemble.dir/spinup ~/ondemand-storm-workflow/singularity/scripts
+ singularity exec --bind /lustre /lustre/singularity_images//solve.sif expect -f /scripts/combine_gr3.exp maxelev 1
spawn combine_gr3
 Input file name (e.g.: maxelev):
maxelev
 Input # of scalar fields:
1
At line 56 of file /schism/src/Utility/Combining_Scripts/combine_gr3.f90 (unit = 10)
Fortran runtime error: Cannot open file 'outputs/maxelev_000000': No such file or directory

Error termination. Backtrace:
#0  0x2b130e026ad0 in ???
#1  0x2b130e027649 in ???
#2  0x2b130e2771f6 in ???
#3  0x55c33fdcc712 in MAIN__
#4  0x55c33fdcc1ce in main
+ singularity exec --bind /lustre /lustre/singularity_images//solve.sif expect -f /scripts/combine_gr3.exp maxdahv 3
spawn combine_gr3
 Input file name (e.g.: maxelev):
maxdahv
 Input # of scalar fields:
3
At line 56 of file /schism/src/Utility/Combining_Scripts/combine_gr3.f90 (unit = 10)
Fortran runtime error: Cannot open file 'outputs/maxdahv_000000': No such file or directory

Error termination. Backtrace:
#0  0x2b5843e68ad0 in ???
#1  0x2b5843e69649 in ???
#2  0x2b58440b91f6 in ???
#3  0x5611f4b3e712 in MAIN__
#4  0x5611f4b3e1ce in main
+ mv maxdahv.gr3 maxelev.gr3 -t outputs
mv: cannot stat 'maxdahv.gr3': No such file or directory
mv: cannot stat 'maxelev.gr3': No such file or directory
SorooshMani-NOAA commented 11 months ago

@FariborzDaneshvar-NOAA after the latest mesh fix I reran Irene and it is already fixed by the fix for Florence. I'm running a 2 member ensemble (up to now 8 hours for the members!) Since it takes time, instead of running your own version, I suggest inspecting my last night's run and close the ticket if mesh looks good. Thanks!

FariborzDaneshvar-NOAA commented 11 months ago

Thanks @SorooshMani-NOAA!