marekandreas / elpa

A scalable eigensolver for dense, symmetric (hermitian) matrices (fork of https://gitlab.mpcdf.mpg.de/elpa/elpa.git)
Other
25 stars 11 forks source link

Latest ELPA-2021.11.001 complains about blacsgrid #13

Closed toxa81 closed 1 year ago

toxa81 commented 2 years ago

I changed to the latest ELPA and now this error is reported:

ELPA_SETUP ERROR: your provided blacsgrid is not ok!
BLACS_GRIDINFO returned an error! Aborting...

What exactly is not ok and how I should fix the blacs grid (4x4 in my case)?

marekandreas commented 2 years ago

When you setup your blacs grid, please check the return value (info) of the "descinit" routine. Info must be zero, otherwise, ELPA will not work when it also checks it.

toxa81 commented 2 years ago

Hi @marekandreas ! I'm checking the info of descinit() and it's zero. I'm pretty sure that BLACS grid is set correctly on my side of the code because 1) I can use ScaLAPCK and 2) I used previous version of ELPA. I will investigate more.

marekandreas commented 2 years ago

ok, that is interesting, if descinit() returns 0. Then the internal check in ELPA should also not fail. Can you give me the matrix size, number of eigenvectors, block size, and the number of MPI processes you want to use.

toxa81 commented 2 years ago

Can it be related to the small matrix size? I need to solve different eigen-problems - small and large. Small starts with the matrix size 10, large will be around ~4K. To not bother with different solvers, I use ELPA in both cases with the block size of 32 and 4x4 MPI grid. For small problem the local matrix will be non-zero only on the 1-st rank, the rest will be idle. My guess is that ELPA doesn't like this setup.

marekandreas commented 2 years ago

This is indeed a problematic approach: you cannot use the same blacsgrid for all possible combinations of matrix sizes. As an example a 10x10 matrix cannot be correctly blacs distributed with 16 MPI tasks. descinit() will fail for such a setup: DESCINIT parameter number 9 had an illegal value. You can check this for example by calling one of the ELPA test programs with the parameters "10 10 32". So I do not understand why in your application the call to descinit() does not return info != 0. Is it possible that you do the blacs setup for a larger matrix, and then only later for a small matrix ELPA does encounter the error? The call to to descinit() does depend among others on the matrix size... What you will have to do if your matrix sizes vary that strongly within one application, is to create multiple MPI communicators which contain just a subset of MPI_COMM_WORLD. And then you can create with each sub-communicator an independent ELPA object, for small and large matrices. Your problem is similar to #9

rasolca commented 2 years ago

A 10x10 matrix CAN be correctly blacs distributed with 16 MPI tasks, using the correct parameters. DESCINIT parameter number 9 is the leading dimension. If it fails for this small matrices it means that it is set to 0 instead to 1. In fact the leading dimension is not handled correctly in your tests e.g https://github.com/marekandreas/elpa/blob/cc266ee2cc8db94837892775b58caa42efba3840/test_project_C/src/test_blacs_infrastructure.F90#L122 .

marekandreas commented 2 years ago

Ok, now I am a bit confused. In the ELPA test programs the leading dimension na_rows in call descinit(sc_desc, na, na, nblk, nblk, 0, 0, my_blacs_ctxt, na_rows, info) is computed via na_rows = numroc(na, nblk, my_prow, 0_BLAS_KIND, np_rows) For np_rows = 4 and na=10 you will obtain na_rows=0. Not sure how you want to avoid this

marekandreas commented 2 years ago

but independent of this, if you have a setup which requires only the 1st.-rank to contain non-zeros and the rest be idle this is a setup which ELPA has not been written for. The idea was always to use an ELPA object with a MPI sub-communicator which does not create this problem.

rasolca commented 2 years ago

The leading dimension is not the number of rows of the matrix but represent how it is stored in the memory (distance between two columns). For a empty matrix this value has no meaning and therefore can be set to any value. As 0 can create problems, the leading dimension has to be at least 1. The simplest solution is lld = max(1, na_rows).

Beside of this, I don't think it is a good idea to constrain the matrix sizes with artificial limits. In any case it should be documented.

marekandreas commented 2 years ago

I am aware of the definition of the leading dimension, but I do understand the point you want to make. However, ELPA is not designed to work like you would like to use it, I am sorry. Empty ranks are not allowed. The ELPA algorithms aim for maximum speed and do not cover every corner case like ScaLAPACK does (I am sure your intended setup would work with ScaLAPACK solvers just fine). Furthermore ELPA was never intended to work with matrices sizes smaller than a couple of hundred (this is why you can run into this problem for small matrices)

So I can only suggest that you do use for small matrices a MPI-subcomunicator. It is also not difficult to decide according to the matrix size and number of MPI processes which you intend to use, whether ELPA will work with this setup, or whether a smaller number of ranks should be used and then to create a sub-communicator and allocate a different ELPA object. I agree that documenting this better should be done

You are more than welcome to contribute to ELPA and to extend it to the use case you do have in mind

toxa81 commented 2 years ago

I agree with @rasolca that lld = max(1, na_rows_local) can be an easy and harmless fix. We do it in our code and that is the reason why descinit doesn't fail for us. BTW, MKL particularly doesn't like the zero leading dimension.

marekandreas commented 2 years ago

Of course, using lld = max(1, na_rows_local) will allow descinit to run correctly. However, this does not change the fact that for ELPA ,this is nevertheless not a correct setup if na_rows_local = 0 and ELPA will fail eventually. So let me be clear again: you cannot run setups with ELPA if the combination of a small matrix and too many MPI ranks produces na_rows = 0. Extending ELPA to work with such a setup is considerable work.