can't use jdftx on NERSC Perlmutter

shankar1729 / jdftx

JDFTx: software for joint density functional theory

http://jdftx.org

84 stars 54 forks source link

can't use jdftx on NERSC Perlmutter #252

Closed mlei012 closed 2 years ago

mlei012 commented 2 years ago

I compiled jdftx on NERSC Perlmutter following this website: https://jdftx.org/Supercomputers.html

It shows an error in the output file, the details are attached slurm-3246040.txt : (I also compiled jdftx on NERSC Cori, and no error, everything is ok)

----- Setting up reduced wavefunction bases (one per k-point) ----- average nbasis = 4337.000 , ideal nbasis = 4272.076 (GTL DEBUG: 1) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272 (GTL DEBUG: 2) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272 MPICH ERROR [Rank 1] [job id 3246040.0] [Fri Sep 23 17:26:26 2022] [nid001616] - Abort(475555330) (rank 1 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Invalid count, error stack: MPIR_CRAY_Bcast_Tree(183).................: message sizes do not match across processes in the collective routine: Received -32766 but expected 20992

shankar1729 commented 2 years ago

Can you post your job file as well for this? If you did not, make sure you include gpu_bind=none in the SBATCH lines. (This became necessary after the last Cray update, and I don't think I have updated the page yet.)

Additionally, your example job cannot really use 8 GPUs since it has a single k-point (nStates = 1). For a more realistic test, also run a calculation that has at least as many nStates as GPUs you are using.

Best, Shankar

mlei012 commented 2 years ago

Thank you for your reply, Shankar.

The input file comes from the JDFTx tutorial. Now I added 333 kpoints.

BTW, there is no ''gpu_bind=none'' option in the srun usage. submit.txt water.txt

shankar1729 commented 2 years ago

The relevant line is:

#SBATCH --gpu-bind=none

Also, you can test a preliminary version of a module we are building for public use, if you prefer. Here's an example job file using that module:

#!/bin/bash
#SBATCH -A <ACCOUNT>_g
#SBATCH -C gpu
#SBATCH -q regular_ss11
#SBATCH -t 5:00
#SBATCH -N 2
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH -o gpu.o%j

module use /global/cfs/cdirs/m4025/Software/Perlmutter/modules
module load jdftx/gpu

export SLURM_CPU_BIND="cores"
export JDFTX_MEMPOOL_SIZE=8192

srun jdftx_gpu -i in

Note that the module takes care of specifying some of the environment variables (like MPICH_GPU_SUPPORT_ENABLED), so you don't need it it in the job file. Of course, the MEMPOOL size still depends on the job and should be specified explicitly.

Best, Shankar

mlei012 commented 2 years ago

It works now. Thank you so much.