Closed mlei012 closed 2 years ago
Can you post your job file as well for this? If you did not, make sure you include gpu_bind=none in the SBATCH lines. (This became necessary after the last Cray update, and I don't think I have updated the page yet.)
Additionally, your example job cannot really use 8 GPUs since it has a single k-point (nStates = 1). For a more realistic test, also run a calculation that has at least as many nStates as GPUs you are using.
Best, Shankar
Thank you for your reply, Shankar.
The input file comes from the JDFTx tutorial. Now I added 333 kpoints.
BTW, there is no ''gpu_bind=none'' option in the srun usage. submit.txt water.txt
The relevant line is:
#SBATCH --gpu-bind=none
Also, you can test a preliminary version of a module we are building for public use, if you prefer. Here's an example job file using that module:
#SBATCH -C gpu
#SBATCH -q regular_ss11
#SBATCH -t 5:00
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=none
#SBATCH -o gpu.o%j
module use /global/cfs/cdirs/m4025/Software/Perlmutter/modules
module load jdftx/gpu
export SLURM_CPU_BIND="cores"
srun jdftx_gpu -i in
Note that the module takes care of specifying some of the environment variables (like MPICH_GPU_SUPPORT_ENABLED), so you don't need it it in the job file. Of course, the MEMPOOL size still depends on the job and should be specified explicitly.
Best, Shankar
It works now. Thank you so much.
I compiled jdftx on NERSC Perlmutter following this website:
It shows an error in the output file, the details are attached slurm-3246040.txt : (I also compiled jdftx on NERSC Cori, and no error, everything is ok)
----- Setting up reduced wavefunction bases (one per k-point) ----- average nbasis = 4337.000 , ideal nbasis = 4272.076 (GTL DEBUG: 1) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272 (GTL DEBUG: 2) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272 MPICH ERROR [Rank 1] [job id 3246040.0] [Fri Sep 23 17:26:26 2022] [nid001616] - Abort(475555330) (rank 1 in comm 0): Fatal error in MPIR_CRAY_Bcast_Tree: Invalid count, error stack: MPIR_CRAY_Bcast_Tree(183).................: message sizes do not match across processes in the collective routine: Received -32766 but expected 20992