Open Never-settle opened 2 years ago
Hi, I'm sorry for the delay. This error also happens when you create a completely new solver each iteration? Can you try with a different ordering algorithm? For instance:
spss->options().set_reordering_method(ReorderingStrategy::AND);
Do you properly delete the spss
object?
Recently, we have been using STRUMPACK to solve the system of linear equations, which is a sparse matrix with a scale of 300000 x 300000, and it will be solved for 200 iterations. In each iteration, the matrix and right-hand side only have different values but the same structure. And then some of the questions we ran into:
My code is modified from the example “testMMdoubleMPIDist.cpp” provided in the STRUMPACK library and uses MPI and OpenMP hybrid parallel programming.
In the first version, we recreate a solver in each iteration: “StrumpackSparseSolverMPIDist<double,int> spss = new StrumpackSparseSolverMPIDist<double,int> (MPI_COMM_WORLD);”, then use “(spss).set_distributed_csr_matrix(local_n, local_row_ptr.data(), local_col_ind.data(), local_values.data(), dist /, false/);”. After reordering(“(spss).reorder()”) and numerical factorization(“(spss).factor()”), we start solving the system of linear equations(“(*spss).solve(local_b.data(), laocal_x.data())”). In the first iteration, it can be solved correctly, however, during the second iteration, when performing the reordering, the program reports an error and gives the following error message: “Intel MKL BLACS fatal error: cannot allocate memory, aborted.”
【 More complete error display】
Initializing STRUMPACK
using 1 OpenMP thread(s)
using 24 MPI processes
matrix equilibration, r_cond = 1 , c_cond = 1 , type = N
initial matrix:
- number of unknowns = 349,272
- number of nonzeros = 5,048,562
nested dissection reordering:
- Metis reordering
- used METIS_NodeNDP (iso METIS_NodeND)
- supernodal tree from METIS_NodeNDP is used
- strategy parameter = 8
- number of separators = 43,659
- number of levels = 12
- nd time = 4.54
- symmetrization time = 0.0249
Intel MKL BLACS fatal error: cannot allocate memory, aborted. Intel MKL BLACS fatal error: cannot allocate memory, aborted. Intel MKL BLACS fatal error: cannot allocate memory, aborted. Intel MKL BLACS fatal error: cannot allocate memory, aborted. Intel MKL BLACS fatal error: cannot allocate memory, aborted. Intel MKL BLACS fatal error: cannot allocate memory, aborted. 【End】
In the second version, we only create one solver and modified the values of the matrix in it in each iteration (since the structure of the matrix has not changed). We use “(spss).update_matrix_values(local_n, local_row_ptr.data(), local_col_ind.data(), local_values.data(), dist /, false/);”, then call the function “(spss).solve(local_b.data(), laocal_x.data())”. This avoids having to reorder every iteration, but we ran into a new error on the second iteration: “{ -1, -1}: On entry to \n DESCINIT parameter number 6 had an illegal value \n ERROR: Could not create DistributedMatrix descriptor!”
【 More complete error display】
multifrontal factorization:
- estimated memory usage (exact solver) = 2.64e+03 MB
- minimum pivot, sqrt(eps)*|A|_1 = 5.36e-07
- replacing of small pivots is not enabled
{ -1, -1}: On entry to DESCINIT parameter number 6 had an illegal value ERROR: Could not create DistributedMatrix descriptor! { -1, -1}: On entry to DESCINIT parameter number 6 had an illegal value ERROR: Could not create DistributedMatrix descriptor! { -1, -1}: On entry to DESCINIT parameter number 6 had an illegal value ERROR: Could not create DistributedMatrix descriptor! { -1, -1}: On entry to DESCINIT parameter number 6 had an illegal value ERROR: Could not create DistributedMatrix descriptor! { -1, -1}: On entry to DESCINIT parameter number 6 had an illegal value ERROR: Could not create DistributedMatrix descriptor!
=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 227459 RUNNING AT ca0602 = KILLED BY SIGNAL: 9 (Killed)
=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 227460 RUNNING AT ca0602 = KILLED BY SIGNAL: 6 (Aborted)
【End】
We studied for a long time and did not find where the specific problem is. At the same time we simply modified the original example (modify the value of the matrix and loop it many times), and found that it could run correctly and the program did not report an error. We don't know why this error occurs when ported to our code.