nwchemgit / nwchem

NWChem: Open Source High-Performance Computational Chemistry
http://nwchemgit.github.io
Other
504 stars 160 forks source link

NWChem Shifter image fails with MPI errors #775

Closed danielpert closed 1 year ago

danielpert commented 1 year ago

Describe the bug When I try to run a geometry optimization followed by DFT frequency calculation, the program fails after the geometry optimization. The last thing in the .out file is "Multipole analysis of the density". The error message I am getting is:

MPICH ERROR [Rank 63] [job id 8408833.0] [Wed May  3 01:50:11 2023] [nid005166] - Abort(874629263) (rank 63 in comm 496): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(177).................: MPI_Recv(buf=0x7f53e3523c58, count=8, MPI_CHAR, src=96, tag=27624, comm=0x84000001, status=0x7fffeec663b0) failed
MPIR_Wait_impl(41).............:
MPID_Progress_wait(184)........:
MPIDI_Progress_test(80)........:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Message too long - OK)

Describe settings used I am using these environment variables: export OMP_NUM_THREADS=2 export OMP_PROC_BIND=spread export MPICH_GNI_MAX_EAGER_MSG_SIZE=131026 export MPICH_GNI_NUM_BUFS=80 export MPICH_GNI_NDREG_MAXSIZE=16777216 export MPICH_GNI_MBOX_PLACEMENT=nic export MPICH_GNI_RDMA_THRESHOLD=65536 export COMEX_MAX_NB_OUTSTANDING=6

At first I got this error after 3 minutes:

[191] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3337: _put_handler: Assertion `reg_entry' failed[191] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 191] [job id 8235628.0] [Fri Apr 28 12:44:46 2023] [nid005178] - Abort(-1) (rank 191 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 191

srun: error: nid005178: tasks 160,162,168,178,186: Exited with exit code 255
srun: Terminating StepId=8235628.0
[127] header operation not recognized: -431467067
[127] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3277: _progress_server: Assertion `0' failed[127] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 127] [job id 8235628.0] [Fri Apr 28 12:44:47 2023] [nid005146] - Abort(-1) (rank 127 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 127

I added these environmental variables with help from @lastephey which allowed the geometry optimization to run but then I got the error described above when it tried to start calculating the vibrational frequencies: export CXI_FORK_SAFE=1 export CXI_FORK_SAFE_HP=1 export FI_CXI_RX_MATCH_MODE=hybrid export FI_CXI_DEFAULT_CQ_SIZE=128000

Report what operating system and distribution you are using. SUSE Linux Enterprise Server 15 SP4

Attach log files files.zip contains my submission script, nwchem input, starting geometry, and stdout/stderr

To Reproduce

  1. Steps to reproduce the behavior: Run NWchem using the attached input and environment variables with the docker image
  2. Attach all the input files required to run.

Expected behavior I expected the program to complete and calculate the energy and frequencies.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context I am running this on the Perlmutter cluster at the National Energy Research Scientific Computing Center (NERSC)

lastephey commented 1 year ago

Thanks @danielpert. That new OFI error message is interesting. It looks like others have encountered it on Perlmutter/Crusher at OLCF. On Perlmutter they suggested two fixes:

export MPICH_COLL_SYNC=MPI_Bcast

or

export FI_CXI_DEFAULT_CQ_SIZE=71680
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_UNIVERSE_SIZE=4096

Based on this comment it sounds like the first method was more reliable. Would you be willing to test?

Relevant issues: Crusher Perlmutter

jeffhammond commented 1 year ago

It would surprise me a lot if Bcast synchronization mattered to NWChem. It doesn't use it on the critical path anywhere I've read.

With MPI-PR, you'll want to look at settings that impact send-receive flow control.

lastephey commented 1 year ago

Thanks. I don't have enough knowledge to know if 1) this kind of behavior suggests that we just need to find the correct setting (and if so, I'd appreciate any pointers) or 2) this could reflect a problem with our network. If it's the latter, it would be helpful to know as soon as possible so we can engage with our vendor.

danielpert commented 1 year ago

sorry for the delay, the job was waiting a long time on the queue and then perlmutter was also down for a bit. I tested with export MPICH_COLL_SYNC=MPI_Bcast and it failed with one of the same errors I saw before:

[63] header operation not recognized: -212919191
[63] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3277: _progress_server: Assertion `0' failed[63] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 63] [job id 8441431.0] [Fri May  5 09:40:55 2023] [nid005664] - Abort(-1) (rank 63 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 63

I will also test with the other method

edoapra commented 1 year ago

@danielpert Could you try the following Slurm script that uses a NWChem 7.2.0 Shifter image (nacl16_co.nw is the input file name in this example)?

#!/bin/bash
#SBATCH -C cpu
#SBATCH -t 0:29:00
#SBATCH -q debug
#SBATCH -N 8
#SBATCH -A XXXX
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-node=64
#SBATCH -J nacl16_1co
#SBATCH -o nacl16_1co.%j.out
#SBATCH -e nacl16_1co.%j.out
#SBATCH --image=ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:20230203_160345
echo image nwchemgit/nwchem-dev 20230203_160345
module purge
module load PrgEnv-gnu
module load cudatoolkit
module load cray-pmi
module list
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
export COMEX_MAX_NB_OUTSTANDING=6
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=16384
export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_OFLOW_BUF_COUNT=6
export MPICH_SMP_SINGLE_COPY_MODE=CMA
srun -N $SLURM_NNODES --cpu-bind=cores shifter --module=mpich nwchem nacl16_1co.nw
danielpert commented 1 year ago

I tried that, the job did not fail but kind of just stopped and didn't do anything until it hit the wall time. I got this warning:

PE 191: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...

I set FI_CXI_DEFAULT_CQ_SIZE=71680 but got the same issue.

danielpert commented 1 year ago

I can try increasing it more to 143360?

danielpert commented 1 year ago

I also got this message:

Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source
/opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.[csh|sh]'.

I can try adding source /opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.sh to my script after module purge. This seems to work when I run in the terminal without any warnings. Not sure if this is the issue though

Update: I am still getting the same issue

edoapra commented 1 year ago

@danielpert I have a fix for the poorly parallelized code that was causing the error posted in https://github.com/nwchemgit/nwchem/issues/775#issuecomment-1539325239

This fix is applied to the image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411

Could you please try the same Slurm batch script I posted in https://github.com/nwchemgit/nwchem/issues/775#issuecomment-1538741540 with the new shifter image?

#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411
danielpert commented 1 year ago

I cannot seem to use that image, when I submit the submission script I get this error sbatch: error: Failed to lookup image. Aborting.

edoapra commented 1 year ago

Sorry about giving the wrong image name. I missed one last 1 character Here are correct lines for the Slurm script

#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111
danielpert commented 1 year ago

yes when I try that image my job runs successfully!

edoapra commented 1 year ago

yes when I try that image my job runs successfully!

Thank you very much for this feedback. Let me do more testing on this change just to be sure it does not break any other functionality.

edoapra commented 1 year ago

This fix is now present in the default NERSC Shifter images ghcr.io/nwchemgit/nwchem-720.nersc.mpich4.mpi-pr:latest ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:latest

edoapra commented 1 year ago

The NERSC documentation for NWChem was updated with information about the current Shifter information for Perlmutter

https://docs.nersc.gov/applications/nwchem/#slurm-script-for-nwchem-shifter-image-on-perlmutter-cpus

lastephey commented 1 year ago

Thanks @edoapra!

Just a heads up that we're working on a new container runtime called podman-hpc: https://docs.nersc.gov/development/podman-hpc/overview/

It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).

edoapra commented 1 year ago

Thanks @edoapra!

Just a heads up that we're working on a new container runtime called podman-hpc: https://docs.nersc.gov/development/podman-hpc/overview/

It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).

Is podman available for any user on Perlmutter at this point in time?

lastephey commented 1 year ago

Yes, it's open to all users without any additional configuration required. Anyone can test today.