Issues in the bandred routines

marekandreas / elpa

A scalable eigensolver for dense, symmetric (hermitian) matrices (fork of https://gitlab.mpcdf.mpg.de/elpa/elpa.git)

Other

27 stars 13 forks source link

Issues in the bandred routines #22

Closed fstein93 closed 1 year ago

fstein93 commented 1 year ago

I am one of the CP2K developers and I am currently attempting to enforce block sizes of powers of 2 whenever we employ ELPA to solve eigenvalue problems (see https://github.com/cp2k/cp2k/pull/2407). The code works fine on CPU but not on GPU, where ELPA occasionally throws

ELPA2: bandred returned an error. Aborting... Problem getting option for debug settings. Aborting...

It happens with different kinds of tests repeatedly. We suppose that ELPA was never run on GPU in these cases.

marekandreas commented 1 year ago

Dear @fstein93 , ELPA1 and ELPA2 is routinely running on large-scale GPUs: for example 1.7M matrix on 24.000 GPUs on SUMMIT, and as of late 2000M matrix on more than 4000 Mi250x GPUs on LUMI (CSC). So ELPA supports and works on GPUs in general. That said, of course I do not say that there might not be a problem with the special setup you have.

Can you check whether this also happens with ELPA 1stage GPU?

In order to debug this, can you detail when the problem appeared in our case:

which GPUs, which version of CUDA ? (I assume you used NVIDIA)
how many GPUs/node?
how many nodes?
how many MPI tasks/node?
with or without OpenMP build?
and finally matrix size, number of eigenvectors, and BLACS block size?

Thanks

fstein93 commented 1 year ago

All tests are run on a NVIDIA Tesla P4, 1 node (24x Intel Xeon W 2000 / D-2100), 1GPU per node, 2 MPI tasks per node, 2 threads per rank. The NVIDIA driver is nvidia/cuda:11.8.0-devel-ubuntu22.04 . Quadratic blocks are employed.

In one case, we have the following parameters (printout by CP2K) ELPA| Matrix diagonalization information ELPA| Matrix order (NA) 23 ELPA| Matrix block size (NBLK) 4 ELPA| Number of eigenvectors (NEV) 23 ELPA| Local rows (LOCAL_NROWS) 12 ELPA| Local columns (LOCAL_NCOLS) 23 ELPA| Kernel NVIDIA_GPU ELPA| QR step requested NO In this case, there are 2 process rows and 1 process columns.

I will still check with the 1stage solver.

marekandreas commented 1 year ago

Dear @fstein93,

thank you for the update. I will try to reproduce the problem you encountered.

fstein93 commented 1 year ago

A short update: With the 1stage solver, the given error does not occur, but we have a timeout, probably related to #17 .

fstein93 commented 1 year ago

I have just recognized that I missed a word in my original post: We have probably never run our regression tests (or at least the failing ones) with the GPU kernel. I did not mean that ELPA by itself was never run on GPU.

marekandreas commented 1 year ago

No, worries, I thought you meant it like this, just wanted to be clear

fstein93 commented 1 year ago

Interestingly, I can run the same calculations on the Piz Daint supercomputer without any error (Intel® Xeon® E5-2690 v3 @ 2.60GHz (12 cores, 64GB RAM), one NVIDIA® Tesla® P100 16GB per node). I have tried with 12 ranks per node and 1 thread per rank and with 6 ranks per node and 2 threads per rank. Both make use of version 2022.05.001 . But on Daint, I do not make use of ELPA's Openmp version.

marekandreas commented 1 year ago

Hello @fstein93, can you test again with the ELPA 1solver an ELPA 2022.11.001.rc2?

fstein93 commented 1 year ago

We have already switched to the release candidate, and your fix(es) resolved all observed issues with the 2-stage solver.