neams-th-coe / cardinal

High-Fidelity Multiphysics
https://cardinal.cels.anl.gov/
Other
95 stars 47 forks source link

Add build instructions for Bitterroot #932

Open aprilnovak opened 4 months ago

aprilnovak commented 4 months ago

Reason

New machine coming to INL, let's make sure we know how to build Cardinal on it.

Design

Add Bitterroot as a system to Cardinal's HPC documents.

Impact

Better user experience.

lewisgross1296 commented 4 months ago

I was able to ssh into Bitterrot, but upon opening my terminal, the ~/.bashrc I have for Sawtooth complained

Lmod has detected the following error:  The following module(s) are unknown: "openmpi/4.1.6-gcc-12.3.0-panw"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "openmpi/4.1.6-gcc-12.3.0-panw"

Also make sure that all modulefiles written in TCL start with the string #%Module

Lmod has detected the following error:  The following module(s) are unknown: "cmake/3.27.7-gcc-12.3.0-5cfk"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "cmake/3.27.7-gcc-12.3.0-5cfk"

Also make sure that all modulefiles written in TCL start with the string #%Module

Lmod has detected the following error:  The following module(s) are unknown: "gcc/12.3.0-gcc-10.5.0-vx2f"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "gcc/12.3.0-gcc-10.5.0-vx2f"

Also make sure that all modulefiles written in TCL start with the string #%Module

seems like those modules exist when I module avail, but perhaps the syntax is causing an issue. Maybe I should remove this from my ~/.bashrc

###################### CARDINAL ENVIRONMENT ######################
module purge
module load use.moose
module load moose-tools
module load openmpi/4.1.6-gcc-12.3.0-panw
module load cmake/3.27.7-gcc-12.3.0-5cfk
module load gcc/12.3.0-gcc-10.5.0-vx2f # needed for NekRS

Might just be better to load them in the terminal when building Cardinal? I do get complaints that these modules don't exist when I scp or log onto inlhpclogin, but I just ignore the messages.

aprilnovak commented 4 months ago

You can have if statements in your bashrc which can tell which system you're on. It's a similar setup at OLCF. Here's what we do for Frontier vs. Summit, maybe a similar syntax will work on Bitteroot/Sawtooth.

if [ $LMOD_SYSTEM_NAME = frontier ]; then
    module purge
    module load PrgEnv-gnu craype-accel-amd-gfx90a cray-mpich rocm cray-python/3.9.13.1 cmake/3.21.3
    module unload cray-libsci

    # Revise for your Cardinal repository location
    DIRECTORY_WHERE_YOU_HAVE_CARDINAL=$HOME/frontier
    cd $DIRECTORY_WHERE_YOU_HAVE_CARDINAL

    HOME_DIRECTORY_SYM_LINK=$(realpath -P $DIRECTORY_WHERE_YOU_HAVE_CARDINAL)
    export NEKRS_HOME=$HOME_DIRECTORY_SYM_LINK/cardinal/install

    export OPENMC_CROSS_SECTIONS=/lustre/orion/fus166/proj-shared/novak/cross_sections/endfb-vii.1-hdf5/cross_sections.xml
fi
lewisgross1296 commented 4 months ago

Thanks to @loganharbour this submit script using module load cardinal-mpich worked for me on bitterroot. It's running a pretty hefty job quickly. Maybe his apptainer knowledge could be useful for more detailed build from source info

#!/bin/sh
#This file is called submit-script.sh
#SBATCH --partition=general      # default general (option short or hbm)
#SBATCH --time=7-00:00:00        # run time in days-hh:mm:ss (6 hours is the max for short)
#SBATCH --nodes=32               # number of job nodes (max is 168 nodes on general, 336 nodes on short)
#SBATCH --ntasks-per-node=1      # mpi ranks per node
#SBATCH --cpus-per-task=112      # threads per mpi rank
#SBATCH --wckey=moose            # project code
#SBATCH --error=small_inf_assembly.err.%J
#SBATCH --output=small_inf_assembly.txt.%J

module purge
module load use.moose moose-containers cardinal-mpich

JOB_DIR=/home/groslewi/gcmr/mwes/25kp_dt1e-2_small_inf_assembly

export MV2_USE_ALIGNED_ALLOC=1
export MV2_THREADS_PER_PROCESS=${SLURM_CPUS_PER_TASK}
mpiexec cardinal-opt -i ${JOB_DIR}/openmc.i --n-threads=${SLURM_CPUS_PER_TASK}
aprilnovak commented 4 months ago

Does cardinal-mpich include NekRS in it?

loganharbour commented 4 months ago

Does cardinal-mpich include NekRS in it?

It does. It's the base for what's being used for docker: openmc, dagmc, nekrs

lewisgross1296 commented 4 months ago

@AyaHegazy22 and I chatted a bit and she was unable to recreate the success. This makes sense though, as we discovered that I was also only able to run the job on some select nodes. Thanks to Logan, it should work on every node now.

I just launched a job that is running. Aya, if you get a chance try again. Here's my working submit script. (has a few better defaults for #SBATCH)

#!/bin/sh
#This file is called submit-script.sh
#SBATCH --partition=general      # default general (option short or hbm)
#SBATCH --time=0-06:00:00        # run time in days-hh:mm:ss (6 hours is the max for short)
#SBATCH --nodes=24               # number of job nodes (max is 168 nodes on general, 336 nodes on short)
#SBATCH --ntasks-per-node=1      # mpi ranks per node
#SBATCH --cpus-per-task=112      # threads per mpi rank
#SBATCH --wckey=moose            # project code
#SBATCH --error=small_inf_assembly.err.%J
#SBATCH --output=small_inf_assembly.txt.%J

module purge
module load use.moose moose-containers cardinal-mpich/2024.07.12-b44370a
JOB_DIR=/home/groslewi/gcmr/mwes/small_inf_assembly

export MV2_USE_ALIGNED_ALLOC=1
export MV2_THREADS_PER_PROCESS=${SLURM_CPUS_PER_TASK}
mpiexec cardinal-opt -i ${JOB_DIR}/openmc.i --n-threads=${SLURM_CPUS_PER_TASK}
meltawila commented 2 months ago

@lewisgross1296 is there any update on this? with the above it looks like you were still using the pre-built Cardinal only, right?

lewisgross1296 commented 2 months ago

I have not tried to build from source on Bitterroot, since it seems that the suggested way is to use the Apptainer provided. The container has worked pretty well so far tho.

I have yet to try a Nek case, so can't confirm behavior there.If @loganharbour is able to share the Apptainer build script, that might be useful for others trying to build from source.