Closed danielpert closed 1 year ago
Thanks @danielpert. That new OFI error message is interesting. It looks like others have encountered it on Perlmutter/Crusher at OLCF. On Perlmutter they suggested two fixes:
export MPICH_COLL_SYNC=MPI_Bcast
or
export FI_CXI_DEFAULT_CQ_SIZE=71680
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_UNIVERSE_SIZE=4096
Based on this comment it sounds like the first method was more reliable. Would you be willing to test?
Relevant issues: Crusher Perlmutter
It would surprise me a lot if Bcast synchronization mattered to NWChem. It doesn't use it on the critical path anywhere I've read.
With MPI-PR, you'll want to look at settings that impact send-receive flow control.
Thanks. I don't have enough knowledge to know if 1) this kind of behavior suggests that we just need to find the correct setting (and if so, I'd appreciate any pointers) or 2) this could reflect a problem with our network. If it's the latter, it would be helpful to know as soon as possible so we can engage with our vendor.
sorry for the delay, the job was waiting a long time on the queue and then perlmutter was also down for a bit. I tested with export MPICH_COLL_SYNC=MPI_Bcast
and it failed with one of the same errors I saw before:
[63] header operation not recognized: -212919191
[63] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3277: _progress_server: Assertion `0' failed[63] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 63] [job id 8441431.0] [Fri May 5 09:40:55 2023] [nid005664] - Abort(-1) (rank 63 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 63
I will also test with the other method
@danielpert
Could you try the following Slurm script that uses a NWChem 7.2.0 Shifter image (nacl16_co.nw
is the input file name in this example)?
#!/bin/bash
#SBATCH -C cpu
#SBATCH -t 0:29:00
#SBATCH -q debug
#SBATCH -N 8
#SBATCH -A XXXX
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-node=64
#SBATCH -J nacl16_1co
#SBATCH -o nacl16_1co.%j.out
#SBATCH -e nacl16_1co.%j.out
#SBATCH --image=ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:20230203_160345
echo image nwchemgit/nwchem-dev 20230203_160345
module purge
module load PrgEnv-gnu
module load cudatoolkit
module load cray-pmi
module list
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
export COMEX_MAX_NB_OUTSTANDING=6
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=16384
export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_OFLOW_BUF_COUNT=6
export MPICH_SMP_SINGLE_COPY_MODE=CMA
srun -N $SLURM_NNODES --cpu-bind=cores shifter --module=mpich nwchem nacl16_1co.nw
I tried that, the job did not fail but kind of just stopped and didn't do anything until it hit the wall time. I got this warning:
PE 191: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...
I set FI_CXI_DEFAULT_CQ_SIZE=71680 but got the same issue.
I can try increasing it more to 143360?
I also got this message:
Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source
/opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.[csh|sh]'.
I can try adding source /opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.sh
to my script after module purge
. This seems to work when I run in the terminal without any warnings. Not sure if this is the issue though
Update: I am still getting the same issue
@danielpert I have a fix for the poorly parallelized code that was causing the error posted in https://github.com/nwchemgit/nwchem/issues/775#issuecomment-1539325239
This fix is applied to the image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411
Could you please try the same Slurm batch script I posted in https://github.com/nwchemgit/nwchem/issues/775#issuecomment-1538741540 with the new shifter image?
#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411
I cannot seem to use that image, when I submit the submission script I get this error
sbatch: error: Failed to lookup image. Aborting.
Sorry about giving the wrong image name. I missed one last 1
character
Here are correct lines for the Slurm script
#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111
yes when I try that image my job runs successfully!
yes when I try that image my job runs successfully!
Thank you very much for this feedback. Let me do more testing on this change just to be sure it does not break any other functionality.
This fix is now present in the default NERSC Shifter images ghcr.io/nwchemgit/nwchem-720.nersc.mpich4.mpi-pr:latest ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:latest
The NERSC documentation for NWChem was updated with information about the current Shifter information for Perlmutter
https://docs.nersc.gov/applications/nwchem/#slurm-script-for-nwchem-shifter-image-on-perlmutter-cpus
Thanks @edoapra!
Just a heads up that we're working on a new container runtime called podman-hpc
: https://docs.nersc.gov/development/podman-hpc/overview/
It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).
Thanks @edoapra!
Just a heads up that we're working on a new container runtime called
podman-hpc
: https://docs.nersc.gov/development/podman-hpc/overview/It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).
Is podman available for any user on Perlmutter at this point in time?
Yes, it's open to all users without any additional configuration required. Anyone can test today.
Describe the bug When I try to run a geometry optimization followed by DFT frequency calculation, the program fails after the geometry optimization. The last thing in the .out file is "Multipole analysis of the density". The error message I am getting is:
Describe settings used I am using these environment variables: export OMP_NUM_THREADS=2 export OMP_PROC_BIND=spread export MPICH_GNI_MAX_EAGER_MSG_SIZE=131026 export MPICH_GNI_NUM_BUFS=80 export MPICH_GNI_NDREG_MAXSIZE=16777216 export MPICH_GNI_MBOX_PLACEMENT=nic export MPICH_GNI_RDMA_THRESHOLD=65536 export COMEX_MAX_NB_OUTSTANDING=6
At first I got this error after 3 minutes:
I added these environmental variables with help from @lastephey which allowed the geometry optimization to run but then I got the error described above when it tried to start calculating the vibrational frequencies: export CXI_FORK_SAFE=1 export CXI_FORK_SAFE_HP=1 export FI_CXI_RX_MATCH_MODE=hybrid export FI_CXI_DEFAULT_CQ_SIZE=128000
Report what operating system and distribution you are using. SUSE Linux Enterprise Server 15 SP4
Attach log files files.zip contains my submission script, nwchem input, starting geometry, and stdout/stderr
To Reproduce
Expected behavior I expected the program to complete and calculate the energy and frequencies.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context I am running this on the Perlmutter cluster at the National Energy Research Scientific Computing Center (NERSC)