Closed dlkarnitz closed 3 years ago
@trilinos/muelu @trilinos/amesos2
This appears to be hitting a singular case in klu2_factor.hpp:573 which then deletes the Numeric object causing the null field. This matrix was not trouble for MueLu in v12.10.1 nor for ML in v12.10.1. I've attached a description of the matrix via Tpetra_CrsMatrix::describe().
Work-around: Try changing the direct solver that Muelu uses?
I first tried a few of the RELAXATION course solvers (Jacobi, SGS, MTSGS), all caused Belos to produce NaN for the convergence check. I have found success using the CHEBYSHEV coarse solver type so far in my unit testing.
@dlkarnit You could try using something other than KLU2, e.g., build with SuperLU and try that. Relaxation solvers and Chebyshev may work as coarse-grid solvers sometimes, but not for all kinds of problems.
@mhoemmen Any recommendations on using SuperLU, SuperLU_MT, or SuperLU_DIST? Do users find these work better than most KLU/KLU2 solvers in practice?
@mhoemmen Maybe a better question here is; are there better ways to set up MueLu/Belos to handle singular matrices. Since for my equations the matrix will almost always be singular. This looks like it could be an issue in using SuperLU too according to their documentation.
The MueLu tutorial should explain how to tell MueLu about the null space of your matrix, if you have a basis for the null space. You'll have to talk to the MueLu developers @trilinos/muelu about what to do if you don't have a basis for your matrix's null space.
Sparse LU factorizations generally assume that the input matrix is nonsingular. Relaxation solvers and Chebyshev do not promise that they will work in this case. Belos does provide an LSQR solver which you could try, though I'm not sure if MueLu has hooks for that in its input deck.
I'm not sure the new title is appropriate. I am using MueLu as a preconditioner for the BICG method in Belos. Previously, I used ML as a preconditioner for BICG in AztecOO for these very same problems without any trouble. In fact, I can solve these problems correctly with Trilinos release v12.10.1, so a bug has very much so been introduced between then and the current develop branch.
To clarify, I could solve the problems in v12.10.1 with both Belos/MueLu/Tpetra, and AztecOO/ML/Epetra, no crash, no divergence, with expected results produced for all cases.
@dlkarnit Would a more accurate title be "MueLu behavior changed between 12.10.1 and develop, such that preconditioner setup now fails with a singular matrix"?
That feels right. My first two posts describe the crash region. I looked a bit at the history of the Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp source and noticed recent changes from issue #1127 that are directly causing the crash (though may not be actually responsible).
@dlkarnit I adjusted the title a bit to include your investigations of Amesos2. Thanks for reporting!
@trilinos/muelu @trilinos/amesos2
@dlkarnit : Just seeing the issue due to some github notification issues (I am not getting any e-mails from github). May be it is the same problems with @ndellingwood and @jhux2 or @csiefer2 as well.
@srajama1 @dlkarnit This came up in issue #1827. @bathmatt had a similar diagnosis for an issue detected by a debugger (KLU2 received a singular matrix and deleted the Numeric object causing the null field). When I investigated that case, the matrix being given to Amesos2 was singular, which I'm guessing is likely the case here, and the singular matrix comes 'upstream' before the matrix is passed to Amesos2.
As far as Amesos2 behavior in this case, I can add a TEUCHOS_TEST_FOR_EXCEPTION
to throw if the numeric object returns null (due to the matrix being singular for example) so this gets caught before the solve, this doesn't address the issue of a singular matrix being passed to Amesos2 however.
@ndellingwood : Let us add that exception with a message the matrix could be singular. Something upstream has changed so we are starting to see this more often.
@srajama1 Are there any pseudo-inverse solvers in Amesos2 or in some of the Trilinos TPLs?
Do you really want to solve singular systems? One work-around is to perturb to make the matrix non-singular, e.g. add a small diagonal shift.
Nope, not in ones we have. It might be an option in some TPL, but it is not something I have gone looking for. Why did this "work" before ?
@srajama1 I'm having github notification issues too, here's a link to wiki PR #1674 with steps for contacting github to fix (I just submitted fix request)
@ndellingwood : Thanks! Getting all the e-mails now.
FYI - MueLu has an option for catching hard zero diagonals by setting them to one. Set "rap: fix zero diagonals" to true.
@jhux2 is working on a less intrusive diagonal boost for other purposes which could also be used for this.
@egboman I do want to solve a singular system. My concern was that in 12.10.1 my code did not crash, and now using the current develop branch it does crash. I had to change the coarse solve type to CHEBYSHEV to solve the system on the develop branch. My particular problem is the singular matrix that arises from a 3D Poisson problem.
Is the problem here a zero on the diagonal? There are other singular matrices...
@csiefer2 @egboman I do have a number of rows in my matrix that are all zero, I have not made an effort to trim them out.
@dlkarnit If that's the only problem, then the "rap: fix zero diagonals" will fix that.
@dlkarnit : Are you saying you had zero rows and it worked in 12.10.1 ?
@srajama1 Yes.
@csiefer2 If I manually change the diagonal to 1 for the empty rows in the matrix set up I can use the default smoother for "Poisson-3D" without a crash. Which parameter list does the "rap: fix zero diagonals" need to go into? It did not work when I added it to the MueLu preconditioner parameter list.
@dlkarnit Yes, it goes in the MueLu parameter list. By "it did not work", do you mean that it had no effect, or that the MueLu parameter validator threw an exception?
@jhux2 It seems to have no effect, the code crashes in the same place.
Huh. The fix zero diagonals should do basically what you did by hand.
@jhux2 Does the MueLu Dirichlet detection do the right thing for completely zero rows, or do you really need a one on the diagonal?
To answer my own question --- the "rap: fix zero diagonals" setting checks for hard zeros on the diagonal (a blank row should trigger this),
If you manually do ones-and-zeros, these guys get eliminated via the Dirichlet detection.
If the rows are completely empty, the Dirichlet detection won't trigger, which might lead to a subtle error that "rap: fix zero diagonals" won't catch (still not 100% sure how, though).
I would recommend doing the manual ones-and-zeros,.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
This issue was closed due to inactivity for 395 days.
I'm experiencing a crash while attempting to solve a 3D poisson problem with Belos/MueLu/Tpetra built from the develop branch that I retrieved yesterday afternoon. Previously this code would crash at exit related to issue #1889 when It was fully integrated into our product. This crash though is now occurring within our unit tests. I can provide some sample code describing our problem setup that, however, does not reproduce this behavior. I've trace the crash to a NULL field in the KLU numeric factorization step, specifically:
Amesos2_KLU2def.hpp:327 at this point the data.numeric pointer is 0x0 The state of data is: {symbolic = 0x561e4d0, numeric = 0x0, common_ = {tol = 0.001, memgrow = 1.2, initmem_amd = 1.2, initmem = 10, maxwork = 0, btf = 1, ordering = 0, scale = 2, malloc_memory = 0x40e580 malloc@plt, realloc_memory = 0x40d8e0 <realloc@p lt>, free_memory = 0x40f1b0 free@plt, calloc_memory = 0x40e540 calloc@plt, user_order = 0x0, user_data = 0x0, halt_if_singular = 1, status = 1, nrealloc = 0, structural_rank = 16, numerical_rank = 14, singular_col = 6, noffdiag = 0, fl ops = -1, rcond = -1, condest = -1, rgrowth = -1, work = 0, memusage = 420, mempeak = 3404}}
I've also attached the "verbose: high" output from the preconditioner parameters and the stack trace.
muelu-verbose.txt stack-trace.txt