marekandreas / elpa

A scalable eigensolver for dense, symmetric (hermitian) matrices (fork of https://gitlab.mpcdf.mpg.de/elpa/elpa.git)
Other
27 stars 13 forks source link

ELPA GPU kernels is not working on A100 #15

Closed shiba-h closed 2 years ago

shiba-h commented 2 years ago

We have built cp2k-9.1 for NVIDIA A100 and installed elpa-2021.11.001 via its toolchain.

we have the following error message when elpa is called from cp2k-9.1.

 Initializing the GPU devices

Found 8 GPUs
MPI rank 0 uses GPU #0
 ELPA: Warning, GPU usage has been requested but compute kernel is set by the us
 er as non-GPU!
 The compute kernel will be executed on CPUs!

I notice that this error comes from a conditional branch around L. 796 in src/elpa2/elpa2_template.F90. It arises if both the following variables are TRUE: WITH_REAL_NVIDIA_SM80_GPU_KERNEL and GPU_KERNEL. We have both the normal GPU kernel and the kernel for NVIDIA A100 in our executables built via the toolchain of cp2k-9.1, and this seems to be the source of the problem.

For our purpose, it would suffice if we can run the normal GPU kernel (instead of the one for A100). Is is possible to stop building the new A100 GPU kernel via the configure options ?

The following is our current configure options. Our system is Intel Xeon Platinum 8360Y (two sockets), equipped with eight A100 GPUs. The compilers are intel oneAPI compilers (2021.2.0) and cuda 11.2.

../configure --libdir="${pkg_install_dir}/${TARGET}/lib" \
   --enable-openmp=yes \
   --enable-shared=no \
   --enable-static=yes \
   ${other_kernel_flags} \
   --enable-nvidia-gpu=yes \
   --with-cuda-path=${CUDA_PATH} \
   --with-NVIDIA-GPU-compute-capability=sm_80 \ 
   ${other_config_flags}

I appreciate your help on this issue. Thank you in advance.

marekandreas commented 2 years ago

Good point! I will implement asap an option to switch the building of the A100 kernel on/off. Currently this is coupled to " --with-NVIDIA-GPU-compute-capability=sm_80". I suggest that for the time being you set " --with-NVIDIA-GPU-compute-capability=sm_70". This will not slow down the execution of the standard GPU kernel, but will not enable the A100 kernel.

Independent of this workaround, you have stumbled across a bug in line 796 in src/elpa2/elpa2_template.F90. I will prepare a point release of ELPA 2021.11 in the next days to fix this, if both the A100 and the standard GPU kernel are available