oceanmodeling / ondemand-storm-workflow

Other
2 stars 1 forks source link

Post-processing fails when there's an ensemble member run failure #20

Open FariborzDaneshvar-NOAA opened 1 year ago

FariborzDaneshvar-NOAA commented 1 year ago

I was running the workflow for 10 ensembles of florence, 2018 OFCL storm track, but one of runs (run#9) failed. Here is the content of slurm-##.out file for that run:

+ pushd /lustre/hurricanes/florence_2018_96fbb8d9-47d2-41a7-8343-69942e8200be/setup/ensemble.dir/runs/vortex_4_variable_korobov_9
/lustre/hurricanes/florence_2018_96fbb8d9-47d2-41a7-8343-69942e8200be/setup/ensemble.dir/runs/vortex_4_variable_korobov_9 ~/ondemand-storm-workflow/singularity/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images//solve.sif pschism_PAHM_TVD-VL 4
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /lustre/.tmp/ompi.sorooshmani-nhccolab2-00004-1-0031.21145/pid.11872/1/26
  Error:     File exists

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00004-1-0031:12681] [[55680,1],26] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0031:12681] [[55680,1],26] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0031:12681] [[55680,1],26] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00004-1-0031:12678] [[55680,1],22] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0031:12678] [[55680,1],22] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0031:12678] [[55680,1],22] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0031:12681] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0031:12678] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
input in flex scanner failed
[sorooshmani-nhccolab2-00004-1-0031:12591] [[55680,1],18] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0031:12591] [[55680,1],18] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0031:12591] [[55680,1],18] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0031:12591] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
input in flex scanner failed
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[55680,1],26]
  Exit code:    1
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00004-1-0031:11872] 2 more processes have sent help message help-opal-util.txt / mkdir-failed
[sorooshmani-nhccolab2-00004-1-0031:11872] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[sorooshmani-nhccolab2-00004-1-0031:11872] 2 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00004-1-0031:11872] 2 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00004-1-0031:11872] 2 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
FariborzDaneshvar-NOAA commented 1 year ago

Then I ran the post-process only, but it failed with a KeyError: "not all values found in index 'run' Here is the content of slurm-##.out file for post-processing step:

+ singularity run --bind /lustre /lustre/singularity_images//prep.sif combine_ensemble --ensemble-dir /lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/ --tracks-dir /lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir//track_files
[2023-08-01 22:27:54,730] parsing.schism  INFO    : parsing from "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir"
[2023-08-01 22:27:54,799] parsing.schism  WARNING : could not find any run directories with all the specified output patterns
[2023-08-01 22:27:54,826] parsing.schism  WARNING : could not find any run directories with all the specified output patterns
[2023-08-01 22:27:54,845] parsing.schism  WARNING : could not find any run directories with all the specified output patterns
[2023-08-01 22:27:54,867] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:27:56,618] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-08-01 22:27:56,640] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:30:19,440] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:30:22,689] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-08-01 22:30:22,711] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:30:22,731] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-08-01 22:30:22,753] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:30:36,075] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:30:39,237] parsing.schism  WARNING : Files don't contain all the required variables!
[2023-08-01 22:30:39,279] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:34:34,160] parsing.schism  INFO    : found 10 run directories with all the specified output patterns
[2023-08-01 22:34:49,219] parsing.schism  INFO    : found 2 variable(s) in "['out2d_*.nc']": "max_elevation" (10, 558697), "max_elevation_times" (10, 558697)
[2023-08-01 22:34:49,286] parsing.schism  INFO    : subsetted 558697 out of 558697 total nodes (100.00%)
[2023-08-01 22:34:49,286] parsing.schism  INFO    : found 2 variable(s) in "['horizontalVelX_*.nc', 'horizontalVelY_*.nc']": "max_velocity" (10, 558697, 2), "max_velocity_times" (10, 558697, 2)
[2023-08-01 22:34:49,409] parsing.schism  INFO    : subsetted 558697 out of 558697 total nodes (100.00%)
[2023-08-01 22:34:49,419] parsing.schism  INFO    : writing to "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/analyze/perturbations.nc"
[2023-08-01 22:34:49,432] parsing.schism  INFO    : writing to "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/analyze/maxele.63.nc"
[2023-08-01 22:36:11,957] parsing.schism  INFO    : writing to "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/analyze/fort.63.nc"
[2023-08-01 22:37:22,830] parsing.schism  INFO    : writing to "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/analyze/maxvel.63.nc"
[2023-08-01 22:39:24,688] parsing.schism  INFO    : writing to "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/analyze/fort.64.nc"
+ singularity run --bind /lustre /lustre/singularity_images//prep.sif analyze_ensemble --ensemble-dir /lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/ --tracks-dir /lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir//track_files
[2023-08-01 22:43:44,770] klpc_wetonly    INFO    : dividing 70/30% for training/testing the model
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
[2023-08-01 22:43:49,700] klpc_wetonly    INFO    : subsetting nodes
/opt/conda/envs/prep/lib/python3.9/site-packages/stormevents/nhc/track.py:173: UserWarning: It is recommended to specify the file_deck and/or advisories when reading from file
  warnings.warn(
[2023-08-01 22:44:07,824] parsing.adcirc  INFO    : subsetted down to 228054 nodes (40.8%)
[2023-08-01 22:44:07,824] parsing.adcirc  INFO    : saving subset to "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/analyze/linear_k1_p1_n0.025/subset.nc"
[2023-08-01 22:44:59,492] klpc_wetonly    INFO    : loading subset from "/lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/analyze/linear_k1_p1_n0.025/subset.nc"
Traceback (most recent call last):
  File "/opt/conda/envs/prep/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/prep/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/scripts/analyze_ensemble.py", line 392, in <module>
    main(parser.parse_args())
  File "/scripts/analyze_ensemble.py", line 48, in main
    analyze(tracks_dir, ensemble_dir/'analyze')
  File "/scripts/analyze_ensemble.py", line 56, in analyze
    _analyze(tracks_dir, analyze_dir, mann_coef)
  File "/scripts/analyze_ensemble.py", line 221, in _analyze
    training_set = subset.sel(run=training_perturbations['run'])
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1550, in sel
    ds = self._to_temp_dataset().sel(
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/xarray/core/dataset.py", line 2794, in sel
    query_results = map_index_queries(
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/xarray/core/indexing.py", line 190, in map_index_queries
    results.append(index.sel(labels, **options))
  File "/opt/conda/envs/prep/lib/python3.9/site-packages/xarray/core/indexes.py", line 498, in sel
    raise KeyError(f"not all values found in index {coord_name!r}")
KeyError: "not all values found in index 'run'"
ERROR conda.cli.main_run:execute(47): `conda run python -m analyze_ensemble --ensemble-dir /lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir/ --tracks-dir /lustre/hurricanes/florence_2018_Fariborz_v1_10_OFCL/setup/ensemble.dir//track_files` failed. (See above for error)
FariborzDaneshvar-NOAA commented 1 year ago

@SorooshMani-NOAA, the same issue happened for the BEST track. Run#6 (out of 10 ensembles) failed with a similar error message! here is the content of slurm-##.out for the failed run:

+ pushd /lustre/hurricanes/florence_2018_8d560103-aa84-4745-9084-9df60f28ea10/setup/ensemble.dir/runs/vortex_4_variable_korobov_6
/lustre/hurricanes/florence_2018_8d560103-aa84-4745-9084-9df60f28ea10/setup/ensemble.dir/runs/vortex_4_variable_korobov_6 ~/ondemand-storm-workflow/singularity/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images//solve.sif pschism_PAHM_TVD-VL 4
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /lustre
  Error:     File exists

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00004-1-0022:12838] [[2549,1],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12838] [[2549,1],0] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12838] [[2549,1],0] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12826] [[2549,1],12] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12826] [[2549,1],12] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12826] [[2549,1],12] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12841] [[2549,1],22] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12841] [[2549,1],22] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12841] [[2549,1],22] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00004-1-0022:12837] [[2549,1],18] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12837] [[2549,1],18] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12837] [[2549,1],18] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12885] [[2549,1],16] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12885] [[2549,1],16] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12885] [[2549,1],16] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12844] [[2549,1],32] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12844] [[2549,1],32] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12844] [[2549,1],32] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12823] [[2549,1],2] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12823] [[2549,1],2] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12823] [[2549,1],2] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12898] [[2549,1],14] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12898] [[2549,1],14] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12898] [[2549,1],14] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12836] [[2549,1],11] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12836] [[2549,1],11] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12836] [[2549,1],11] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12851] [[2549,1],20] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12851] [[2549,1],20] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12851] [[2549,1],20] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12825] [[2549,1],28] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12825] [[2549,1],28] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12825] [[2549,1],28] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12827] [[2549,1],30] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12827] [[2549,1],30] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12827] [[2549,1],30] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12849] [[2549,1],6] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12849] [[2549,1],6] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12849] [[2549,1],6] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12843] [[2549,1],29] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12843] [[2549,1],29] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12843] [[2549,1],29] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12834] [[2549,1],33] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12834] [[2549,1],33] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12834] [[2549,1],33] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12855] [[2549,1],35] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12855] [[2549,1],35] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12855] [[2549,1],35] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12842] [[2549,1],26] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12842] [[2549,1],26] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12842] [[2549,1],26] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12845] [[2549,1],17] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12845] [[2549,1],17] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12845] [[2549,1],17] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12824] [[2549,1],4] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12824] [[2549,1],4] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12824] [[2549,1],4] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12828] [[2549,1],15] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12828] [[2549,1],15] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12828] [[2549,1],15] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12822] [[2549,1],34] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12822] [[2549,1],34] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12822] [[2549,1],34] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12896] [[2549,1],7] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12896] [[2549,1],7] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12896] [[2549,1],7] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12852] [[2549,1],24] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12852] [[2549,1],24] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12852] [[2549,1],24] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12839] [[2549,1],9] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12839] [[2549,1],9] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12839] [[2549,1],9] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12921] [[2549,1],21] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12921] [[2549,1],21] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12921] [[2549,1],21] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12829] [[2549,1],5] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12829] [[2549,1],5] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12829] [[2549,1],5] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12831] [[2549,1],1] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12831] [[2549,1],1] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12831] [[2549,1],1] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12830] [[2549,1],3] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12830] [[2549,1],3] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12830] [[2549,1],3] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12833] [[2549,1],19] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12833] [[2549,1],19] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12833] [[2549,1],19] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12956] [[2549,1],27] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12956] [[2549,1],27] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12956] [[2549,1],27] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:13001] [[2549,1],13] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:13001] [[2549,1],13] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:13001] [[2549,1],13] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12835] [[2549,1],23] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12835] [[2549,1],23] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12835] [[2549,1],23] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12821] [[2549,1],25] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12821] [[2549,1],25] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12821] [[2549,1],25] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
[sorooshmani-nhccolab2-00004-1-0022:12850] [[2549,1],31] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12850] [[2549,1],31] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12850] [[2549,1],31] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12838] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12826] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12841] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12837] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12885] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12844] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12823] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12898] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12836] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12851] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12843] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00004-1-0022:12825] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00004-1-0022:12827] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12849] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12834] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12855] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12845] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12822] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12896] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00004-1-0022:12852] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00004-1-0022:12839] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12829] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12921] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12835] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12850] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12824] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12831] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00004-1-0022:12833] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12828] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00004-1-0022:13001] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12821] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12842] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12956] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12830] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[sorooshmani-nhccolab2-00004-1-0022:12886] [[2549,1],8] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 107
[sorooshmani-nhccolab2-00004-1-0022:12886] [[2549,1],8] ORTE_ERROR_LOG: Error in file ../../../orte/util/session_dir.c at line 346
[sorooshmani-nhccolab2-00004-1-0022:12886] [[2549,1],8] ORTE_ERROR_LOG: Error in file ../../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[sorooshmani-nhccolab2-00004-1-0022:12886] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
input in flex scanner failed
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2549,1],9]
  Exit code:    1
--------------------------------------------------------------------------
[sorooshmani-nhccolab2-00004-1-0022:12127] 34 more processes have sent help message help-opal-util.txt / mkdir-failed
[sorooshmani-nhccolab2-00004-1-0022:12127] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[sorooshmani-nhccolab2-00004-1-0022:12127] 34 more processes have sent help message help-orte-runtime.txt / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00004-1-0022:12127] 34 more processes have sent help message help-orte-runtime / orte_init:startup:internal-failure
[sorooshmani-nhccolab2-00004-1-0022:12127] 34 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
SorooshMani-NOAA commented 1 year ago

@FariborzDaneshvar-NOAA it might be related to permissions it seems: https://github.com/open-mpi/ompi/issues/8510 let me take a look at the directory where it failed to create the file

FariborzDaneshvar-NOAA commented 1 year ago

@SorooshMani-NOAA Thanks for following up. I ran it again and unlike the first time, all runs completed!

SorooshMani-NOAA commented 1 year ago

No problem! I think it might just be some internal openmpi issue, or the fact that versions don't align correctly on PW and Singularity image. In any case I'm building another image with the same version using spack we can try running it with that image as well and see if we see the same issue!

FariborzDaneshvar-NOAA commented 1 year ago

@SorooshMani-NOAA, I was running lustre/scripts/run_schism_ensemble.sh to generate new SCHISM outputs with wind and air pressure for the faked BEST track ensembles, and one of runs failed with this error message!

+ pushd /lustre/hurricanes/florence_2018_Fariborz_OFCL_10_v2/setup/ensemble.dir/runs/vortex_4_variable_korobov_5
/lustre/hurricanes/florence_2018_Fariborz_OFCL_10_v2/setup/ensemble.dir/runs/vortex_4_variable_korobov_5 /lustre/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images/solve.sif pschism_PAHM_TVD-VL 4
At line 96 of file /schism/src/Core/scribe_io.F90 (unit = 16)
Fortran runtime error: Cannot open file './/outputs/mirror.out.scribe': Input/output error

Error termination. Backtrace:
#0  0x2ad95bf44ad0 in ???
#1  0x2ad95bf45649 in ???
#2  0x2ad95c1951f6 in ???
#3  0x5609a0ed2e8b in __scribe_io_MOD_scribe_init
#4  0x5609a0ea9d7f in schism_main_
#5  0x5609a0ea9ef0 in MAIN__
#6  0x5609a0ea9c4e in main
At line 242 of file /schism/src/Hydro/schism_init.F90 (unit = 11)
Fortran runtime error: Cannot open file './/outputs/fatal.error': No such device

Error termination. Backtrace:
#0  0x2b2861667ad0 in ???
#1  0x2b2861668649 in ???
#2  0x2b28618b81f6 in ???
#3  0x563278bf34a6 in schism_init_
#4  0x563278bc7e1a in schism_main_
#5  0x563278bc7ef0 in MAIN__
#6  0x563278bc7c4e in main
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[15010,1],32]
  Exit code:    2
--------------------------------------------------------------------------

Also one of runs for the BEST track failed with a similar message!

+ pushd /lustre/hurricanes/florence_2018_Fariborz_BEST_10_v3/setup/ensemble.dir/runs/vortex_4_variable_korobov_6
/lustre/hurricanes/florence_2018_Fariborz_BEST_10_v3/setup/ensemble.dir/runs/vortex_4_variable_korobov_6 /lustre/scripts
+ mkdir -p outputs
+ mpirun -np 36 singularity exec --bind /lustre /lustre/singularity_images/solve.sif pschism_PAHM_TVD-VL 4
At line 96 of file /schism/src/Core/scribe_io.F90 (unit = 16)
Fortran runtime error: Cannot open file './/outputs/mirror.out.scribe': Input/output error

Error termination. Backtrace:
#0  0x2b78a3e70ad0 in ???
#1  0x2b78a3e71649 in ???
#2  0x2b78a40c11f6 in ???
#3  0x55edf25f0e8b in __scribe_io_MOD_scribe_init
#4  0x55edf25c7d7f in schism_main_
#5  0x55edf25c7ef0 in MAIN__
#6  0x55edf25c7c4e in main
At line 242 of file /schism/src/Hydro/schism_init.F90 (unit = 11)
Fortran runtime error: Cannot open file './/outputs/fatal.error': No such device

Error termination. Backtrace:
#0  0x2afc6c994ad0 in ???
#1  0x2afc6c995649 in ???
#2  0x2afc6cbe51f6 in ???
#3  0x55e73181b4a6 in schism_init_
#4  0x55e7317efe1a in schism_main_
#5  0x55e7317efef0 in MAIN__
#6  0x55e7317efc4e in main
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2288,1],32]
  Exit code:    2
--------------------------------------------------------------------------

The rest of jobs completed successfully! Any thoughts?

SorooshMani-NOAA commented 1 year ago

@FariborzDaneshvar-NOAA I can't think why it is failing. The only thing I can think of is permissions, but even that doesn't make sense! Unless we can find a way to reliably reproduce this error, it's hard to say why it's happening or how to fix it!