oceanmodeling / ondemand-storm-workflow

Other
2 stars 1 forks source link

Some runs of an ensemble failed (orte_init failed for some reaso) #27

Open FariborzDaneshvar-NOAA opened 1 year ago

FariborzDaneshvar-NOAA commented 1 year ago

Run directory on NHC_COLAB_2 cluster: /lustre/hurricanes/florence_2018_eb4f92f0-a1d8-427e-be70-f05d9886789b/setup/ensemble.dir/runs

3 out of 11 runs failed with errors like this:

+ pushd /lustre/hurricanes/florence_2018_eb4f92f0-a1d8-427e-be70-f05d9886789b/setup/ensemble.dir/runs/vortex_4_variable_korobov_3
/lustre/hurricanes/florence_2018_eb4f92f0-a1d8-427e-be70-f05d9886789b/setup/ensemble.dir/runs/vortex_4_variable_korobov_3 /contrib/Fariborz.Daneshvar/home/ondemand-storm-workflow/singularity/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images//solve.sif pschism_PAHM_TVD-VL 4
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /lustre
  Error:     File exists

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00006-1-0014:12853] [[30827,1],17] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12853] [[30827,1],17] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12853] [[30827,1],17] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12898] [[30827,1],1] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12898] [[30827,1],1] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12898] [[30827,1],1] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12889] [[30827,1],10] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12889] [[30827,1],10] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12889] [[30827,1],10] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12920] [[30827,1],28] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12920] [[30827,1],28] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12920] [[30827,1],28] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12893] [[30827,1],19] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12893] [[30827,1],19] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12893] [[30827,1],19] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12869] [[30827,1],20] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12869] [[30827,1],20] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12869] [[30827,1],20] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12910] [[30827,1],27] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12910] [[30827,1],27] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12910] [[30827,1],27] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12887] [[30827,1],33] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12887] [[30827,1],33] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12887] [[30827,1],33] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12871] [[30827,1],34] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12871] [[30827,1],34] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12871] [[30827,1],34] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00006-1-0014:12874] [[30827,1],24] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12874] [[30827,1],24] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12874] [[30827,1],24] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12882] [[30827,1],26] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12882] [[30827,1],26] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12882] [[30827,1],26] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12865] [[30827,1],32] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12865] [[30827,1],32] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12865] [[30827,1],32] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12904] [[30827,1],4] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12904] [[30827,1],4] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12904] [[30827,1],4] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12908] [[30827,1],5] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12908] [[30827,1],5] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12908] [[30827,1],5] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:13089] [[30827,1],15] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:13089] [[30827,1],15] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:13089] [[30827,1],15] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12943] [[30827,1],9] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12943] [[30827,1],9] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12943] [[30827,1],9] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12906] [[30827,1],12] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12906] [[30827,1],12] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12906] [[30827,1],12] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12911] [[30827,1],7] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12911] [[30827,1],7] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12911] [[30827,1],7] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:13035] [[30827,1],21] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:13035] [[30827,1],21] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:13035] [[30827,1],21] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12948] [[30827,1],23] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12948] [[30827,1],23] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12948] [[30827,1],23] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12876] [[30827,1],3] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12876] [[30827,1],3] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12876] [[30827,1],3] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12866] [[30827,1],18] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12866] [[30827,1],18] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12866] [[30827,1],18] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12875] [[30827,1],30] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12875] [[30827,1],30] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12875] [[30827,1],30] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12873] [[30827,1],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12873] [[30827,1],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12873] [[30827,1],0] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12945] [[30827,1],6] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12945] [[30827,1],6] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12945] [[30827,1],6] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12870] [[30827,1],8] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12870] [[30827,1],8] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12870] [[30827,1],8] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12885] [[30827,1],35] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12885] [[30827,1],35] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12885] [[30827,1],35] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12914] [[30827,1],22] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12914] [[30827,1],22] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12914] [[30827,1],22] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12872] [[30827,1],31] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12872] [[30827,1],31] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12872] [[30827,1],31] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12877] [[30827,1],11] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12877] [[30827,1],11] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12877] [[30827,1],11] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12884] [[30827,1],13] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12884] [[30827,1],13] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12884] [[30827,1],13] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12912] [[30827,1],29] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12912] [[30827,1],29] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12912] [[30827,1],29] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12909] [[30827,1],14] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12909] [[30827,1],14] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12909] [[30827,1],14] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12883] [[30827,1],25] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12883] [[30827,1],25] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12883] [[30827,1],25] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00006-1-0014:12907] [[30827,1],16] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12907] [[30827,1],16] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12907] [[30827,1],16] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12898] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00006-1-0014:12889] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12853] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12920] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12893] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12910] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12887] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12871] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12869] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12908] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00006-1-0014:12874] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12882] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12865] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:13089] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12904] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12911] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12906] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12876] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00006-1-0014:12943] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:13035] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00006-1-0014:12948] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12945] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00006-1-0014:12866] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00006-1-0014:12875] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12885] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sorooshmani-nhccolab2-00006-1-0014:12870] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12914] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12884] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12872] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12877] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12912] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12883] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12909] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12907] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12873] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00006-1-0014:12967] [[30827,1],2] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00006-1-0014:12967] [[30827,1],2] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00006-1-0014:12967] [[30827,1],2] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00006-1-0014:12967] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30827,1],18]
  Exit code:    1
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00006-1-0014:12124] 35 more processes have sent help message help-opal-util.txt / mkdir-failed
[sorooshmani-nhccolab2-00006-1-0014:12124] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[sorooshmani-nhccolab2-00006-1-0014:12124] 35 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00006-1-0014:12124] 35 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00006-1-0014:12124] 35 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
FariborzDaneshvar-NOAA commented 1 year ago

Update: All failed runs completed in the second try.