Closed yardasol closed 2 years ago
I did a test run with the base settings (just specify number of batches and particles) and the same geometry and materials. I used the following sbatch file:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --time=01:00:00
#SBATCH --partition=bdwall
#SBATCH --account=openmcvalidation
#SBATCH --job-name=openmc_test_run
#SBATCH --mail-user=oyardas@anl.gov
#SBATCH --mail-type=BEGIN,END,FAIL
# Setup environment (slurm doesn't run a shell, so no bashrc/profile by default
source $HOME/.bashrc
module unload intel
#conda activate openmc-env
conda activate openmc-parallel
# Go to working dir and set cross sections
export OPENMC_CROSS_SECTIONS=$(pwd)/../cross-section-libraries/endfb71_hdf5/cross_sections.xml
#export HDF5_USE_FILE_LOCKING=FALSE
# Determine number of MPI ranks
NUM_RANKS=$((SLURM_JOB_NUM_NODES * 2))
# Run job
srun -N $SLURM_JOB_NUM_NODES \
-n $NUM_RANKS\
--cpu-bind=socket \
mpiexec -n $NUM_RANKS openmc
and ran into the exact same error message. This may point to a larger issue when using parallel hdf5.
I recompiled the openmc
executable after adding the -DHDF5_PREFER_PARALLEL=on
cmake option to the cmake step, and the test case above worked without error. I reran original problem (MicroXS.from_model()
) with the base settings and ran into the original error opening the HDF5 file with 'a'
.
It may be the case that since we are running multiple instances of MicroXS.from_model()
, that we are executing multiple instances of openmc.run()
, which could be causing the concurrent access issue?
I created a test python script that uses the same materials, geometry, and settings as before. I then put them into a Model
object, and call model.init_lib()
and model.run()
. This ran without any issues.
I then broke up the python script into two python files, to mimic the structure of what is happening in our reference problem. I ran two cases: a) calling model.init_lib()
and model.run()
as usual, and b) initializing a TemporaryDirectory
and setting the statepoint files to be written inside the temporary directory before calling model.init_lib()
and model.run()
(in the same way that MicroXS.from_model()
does).
Case a executed sucessfully, but Case b produced the same error as before.
So this bug must be related to our use of temporary directories.
Yeah, this one is really confusing... I did some testing and it looks like running the python script with mpiexec
works fine, but when using a job scheduler like SLURM, the error occurs. @paulromano could this be a BEBOP specific issue?
@yardasol In your batch script, I see you are running srun ... mpiexec ... openmc
which is not right. srun
is effectively a replacement for mpiexec
that is supposed to be aware of the SLURM environment so you should just be running srun ... openmc
. Many MPI implementations also offer native support for SLURM which means you may be able to get by running mpiexec
directly (instead of srun
). I would recommend looking through the SLURM MPI User Guide.
Whoops, good catch @paulromano. I'm fairly certain I added the mpiexec
command there as a santiy check, but you are correct that we don't need both. I've submitted a jobs using only the srun
command like you suggested, and still ran into the error.
after extensive testing with @paulromano, we believe we have discovered a one-line fix to this issue and #2172. Waiting for HPC access to verify things on my end.
I am trying to run a validation simulation for the new transport-independent depletion feature on HPC architecture. To get all the right MPI intracommunicator, I am running Model.init_lib() before passing my Model object to MicroXS.from_model(). In the current source code, this will produce an error as detailed in #2172. However, I am using the following work-around on line 97 of
microxs.py
:This eliminates the error described in #2172, however, now a new error arises:
After some debugging, I tracked this down to line 318 in
state_point.cpp
:I verified this by setting
Settings.sourcepoint
to{'write': False}
andSettings.write_initial_source
toFalse
, which eliminated that error message and returned the following new errorIn summary, there are two issues:
MicroXS.from_model()
System specs: OS: CentOS 7 MPI: MPICH 4.0.2 HDF5: 1.12.2, Parallel (compiled from source)