Error when opening the statepoint file to write the source bank during parallel execution via job scheduler of `MicroXS.from_model()`

yardasol commented 2 years ago

I am trying to run a validation simulation for the new transport-independent depletion feature on HPC architecture. To get all the right MPI intracommunicator, I am running Model.init_lib() before passing my Model object to MicroXS.from_model(). In the current source code, this will produce an error as detailed in #2172. However, I am using the following work-around on line 97 of microxs.py:

            if init_lib:
                model.settings.output = {'path': temp_dir,
                                         'summary': True,
                                         'tallies': False}
                model.settings.sourcepoint = {'write': False}
                model.settings.write_initial_source = False
                model.init_lib()

This eliminates the error described in #2172, however, now a new error arises:

       49/1    1.48717    1.47637 +/- 0.00931
       50/1    1.60745    1.47965 +/- 0.00965
 Creating state point /scratch/tmp0ut1vraq/statepoint.50.h5...
 ERROR: Failed to open HDF5 file with mode 'a': statepoint.50.h5
 ERROR: Failed to open HDF5 file with mode 'a':
        /scratch/tmp0ut1vraq/statepoint.50.h5
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
 ERROR: Failed to open HDF5 file with mode 'a':
        /scratch/tmp0ut1vraq/statepoint.50.h5
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
 ERROR: Failed to open HDF5 file with mode 'a':
        /scratch/tmpjfmb7bjg/statepoint.50.h5
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
slurmstepd: error: *** STEP 2542513.0 ON bdw-0074 CANCELLED AT 2022-08-16T15:10:43 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: bdw-0093: tasks 2-3: Killed
srun: error: bdw-0074: tasks 0-1: Killed

After some debugging, I tracked this down to line 318 in state_point.cpp:

#ifdef PHDF5
  bool parallel = true;
#else
  bool parallel = false;
#endif

  // Write the source bank if desired
  if (write_source_) {
    if (mpi::master || parallel)
      file_id = file_open(filename_, 'a', true);
    write_source_bank(file_id, false);
    if (mpi::master || parallel)
      file_close(file_id);
  }

I verified this by setting Settings.sourcepoint to {'write': False} and Settings.write_initial_source to False, which eliminated that error message and returned the following new error

In summary, there are two issues:

OpenMC can't correctly open the statepoint file to write the source bank
There is a lack of proper machinery to support parallel execution in MicroXS.from_model()

System specs: OS: CentOS 7 MPI: MPICH 4.0.2 HDF5: 1.12.2, Parallel (compiled from source)

yardasol commented 2 years ago

I did a test run with the base settings (just specify number of batches and particles) and the same geometry and materials. I used the following sbatch file:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --time=01:00:00
#SBATCH --partition=bdwall
#SBATCH --account=openmcvalidation
#SBATCH --job-name=openmc_test_run
#SBATCH --mail-user=oyardas@anl.gov
#SBATCH --mail-type=BEGIN,END,FAIL

# Setup environment (slurm doesn't run a shell, so no bashrc/profile by default
source $HOME/.bashrc
module unload intel
#conda activate openmc-env
conda activate openmc-parallel

# Go to working dir and set cross sections
export OPENMC_CROSS_SECTIONS=$(pwd)/../cross-section-libraries/endfb71_hdf5/cross_sections.xml
#export HDF5_USE_FILE_LOCKING=FALSE

# Determine number of MPI ranks
NUM_RANKS=$((SLURM_JOB_NUM_NODES * 2))

# Run job
srun -N $SLURM_JOB_NUM_NODES \
     -n $NUM_RANKS\
      --cpu-bind=socket \
     mpiexec -n $NUM_RANKS openmc

and ran into the exact same error message. This may point to a larger issue when using parallel hdf5.

yardasol commented 2 years ago

I recompiled the openmc executable after adding the -DHDF5_PREFER_PARALLEL=on cmake option to the cmake step, and the test case above worked without error. I reran original problem (MicroXS.from_model()) with the base settings and ran into the original error opening the HDF5 file with 'a'.

It may be the case that since we are running multiple instances of MicroXS.from_model(), that we are executing multiple instances of openmc.run(), which could be causing the concurrent access issue?

yardasol commented 2 years ago

I created a test python script that uses the same materials, geometry, and settings as before. I then put them into a Model object, and call model.init_lib() and model.run(). This ran without any issues.

I then broke up the python script into two python files, to mimic the structure of what is happening in our reference problem. I ran two cases: a) calling model.init_lib() and model.run() as usual, and b) initializing a TemporaryDirectory and setting the statepoint files to be written inside the temporary directory before calling model.init_lib() and model.run() (in the same way that MicroXS.from_model() does).

Case a executed sucessfully, but Case b produced the same error as before.

So this bug must be related to our use of temporary directories.

yardasol commented 2 years ago

Yeah, this one is really confusing... I did some testing and it looks like running the python script with mpiexec works fine, but when using a job scheduler like SLURM, the error occurs. @paulromano could this be a BEBOP specific issue?

paulromano commented 2 years ago

@yardasol In your batch script, I see you are running srun ... mpiexec ... openmc which is not right. srun is effectively a replacement for mpiexec that is supposed to be aware of the SLURM environment so you should just be running srun ... openmc. Many MPI implementations also offer native support for SLURM which means you may be able to get by running mpiexec directly (instead of srun). I would recommend looking through the SLURM MPI User Guide.

yardasol commented 2 years ago

Whoops, good catch @paulromano. I'm fairly certain I added the mpiexec command there as a santiy check, but you are correct that we don't need both. I've submitted a jobs using only the srun command like you suggested, and still ran into the error.

yardasol commented 2 years ago

after extensive testing with @paulromano, we believe we have discovered a one-line fix to this issue and #2172. Waiting for HPC access to verify things on my end.

openmc-dev / openmc

Error when opening the statepoint file to write the source bank during parallel execution via job scheduler of `MicroXS.from_model()` #2177