trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 570 forks source link

MueLu or Amesos2 behavior changed between 12.10.1 and develop, such that preconditioner setup now fails with a singular matrix (Was: "Amesos2 KLU2 Crash") #1937

Closed dlkarnitz closed 3 years ago

dlkarnitz commented 7 years ago

I'm experiencing a crash while attempting to solve a 3D poisson problem with Belos/MueLu/Tpetra built from the develop branch that I retrieved yesterday afternoon. Previously this code would crash at exit related to issue #1889 when It was fully integrated into our product. This crash though is now occurring within our unit tests. I can provide some sample code describing our problem setup that, however, does not reproduce this behavior. I've trace the crash to a NULL field in the KLU numeric factorization step, specifically:

Amesos2_KLU2def.hpp:327 at this point the data.numeric pointer is 0x0 The state of data is: {symbolic = 0x561e4d0, numeric = 0x0, common_ = {tol = 0.001, memgrow = 1.2, initmem_amd = 1.2, initmem = 10, maxwork = 0, btf = 1, ordering = 0, scale = 2, malloc_memory = 0x40e580 malloc@plt, realloc_memory = 0x40d8e0 <realloc@p lt>, free_memory = 0x40f1b0 free@plt, calloc_memory = 0x40e540 calloc@plt, user_order = 0x0, user_data = 0x0, halt_if_singular = 1, status = 1, nrealloc = 0, structural_rank = 16, numerical_rank = 14, singular_col = 6, noffdiag = 0, fl ops = -1, rcond = -1, condest = -1, rgrowth = -1, work = 0, memusage = 420, mempeak = 3404}}

I've also attached the "verbose: high" output from the preconditioner parameters and the stack trace.

muelu-verbose.txt stack-trace.txt

aprokop commented 7 years ago

@trilinos/muelu @trilinos/amesos2

dlkarnitz commented 7 years ago

This appears to be hitting a singular case in klu2_factor.hpp:573 which then deletes the Numeric object causing the null field. This matrix was not trouble for MueLu in v12.10.1 nor for ML in v12.10.1. I've attached a description of the matrix via Tpetra_CrsMatrix::describe().

matrix-description.txt

mhoemmen commented 7 years ago

Work-around: Try changing the direct solver that Muelu uses?

dlkarnitz commented 7 years ago

I first tried a few of the RELAXATION course solvers (Jacobi, SGS, MTSGS), all caused Belos to produce NaN for the convergence check. I have found success using the CHEBYSHEV coarse solver type so far in my unit testing.

mhoemmen commented 7 years ago

@dlkarnit You could try using something other than KLU2, e.g., build with SuperLU and try that. Relaxation solvers and Chebyshev may work as coarse-grid solvers sometimes, but not for all kinds of problems.

dlkarnitz commented 7 years ago

@mhoemmen Any recommendations on using SuperLU, SuperLU_MT, or SuperLU_DIST? Do users find these work better than most KLU/KLU2 solvers in practice?

dlkarnitz commented 7 years ago

@mhoemmen Maybe a better question here is; are there better ways to set up MueLu/Belos to handle singular matrices. Since for my equations the matrix will almost always be singular. This looks like it could be an issue in using SuperLU too according to their documentation.

mhoemmen commented 7 years ago

The MueLu tutorial should explain how to tell MueLu about the null space of your matrix, if you have a basis for the null space. You'll have to talk to the MueLu developers @trilinos/muelu about what to do if you don't have a basis for your matrix's null space.

Sparse LU factorizations generally assume that the input matrix is nonsingular. Relaxation solvers and Chebyshev do not promise that they will work in this case. Belos does provide an LSQR solver which you could try, though I'm not sure if MueLu has hooks for that in its input deck.

dlkarnitz commented 7 years ago

I'm not sure the new title is appropriate. I am using MueLu as a preconditioner for the BICG method in Belos. Previously, I used ML as a preconditioner for BICG in AztecOO for these very same problems without any trouble. In fact, I can solve these problems correctly with Trilinos release v12.10.1, so a bug has very much so been introduced between then and the current develop branch.

dlkarnitz commented 7 years ago

To clarify, I could solve the problems in v12.10.1 with both Belos/MueLu/Tpetra, and AztecOO/ML/Epetra, no crash, no divergence, with expected results produced for all cases.

mhoemmen commented 7 years ago

@dlkarnit Would a more accurate title be "MueLu behavior changed between 12.10.1 and develop, such that preconditioner setup now fails with a singular matrix"?

dlkarnitz commented 7 years ago

That feels right. My first two posts describe the crash region. I looked a bit at the history of the Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp source and noticed recent changes from issue #1127 that are directly causing the crash (though may not be actually responsible).

mhoemmen commented 7 years ago

@dlkarnit I adjusted the title a bit to include your investigations of Amesos2. Thanks for reporting!

@trilinos/muelu @trilinos/amesos2

srajama1 commented 7 years ago

@dlkarnit : Just seeing the issue due to some github notification issues (I am not getting any e-mails from github). May be it is the same problems with @ndellingwood and @jhux2 or @csiefer2 as well.

ndellingwood commented 7 years ago

@srajama1 @dlkarnit This came up in issue #1827. @bathmatt had a similar diagnosis for an issue detected by a debugger (KLU2 received a singular matrix and deleted the Numeric object causing the null field). When I investigated that case, the matrix being given to Amesos2 was singular, which I'm guessing is likely the case here, and the singular matrix comes 'upstream' before the matrix is passed to Amesos2. As far as Amesos2 behavior in this case, I can add a TEUCHOS_TEST_FOR_EXCEPTION to throw if the numeric object returns null (due to the matrix being singular for example) so this gets caught before the solve, this doesn't address the issue of a singular matrix being passed to Amesos2 however.

srajama1 commented 7 years ago

@ndellingwood : Let us add that exception with a message the matrix could be singular. Something upstream has changed so we are starting to see this more often.

aprokop commented 7 years ago

@srajama1 Are there any pseudo-inverse solvers in Amesos2 or in some of the Trilinos TPLs?

egboman commented 7 years ago

Do you really want to solve singular systems? One work-around is to perturb to make the matrix non-singular, e.g. add a small diagonal shift.

srajama1 commented 7 years ago

Nope, not in ones we have. It might be an option in some TPL, but it is not something I have gone looking for. Why did this "work" before ?

ndellingwood commented 7 years ago

@srajama1 I'm having github notification issues too, here's a link to wiki PR #1674 with steps for contacting github to fix (I just submitted fix request)

srajama1 commented 7 years ago

@ndellingwood : Thanks! Getting all the e-mails now.

csiefer2 commented 7 years ago

FYI - MueLu has an option for catching hard zero diagonals by setting them to one. Set "rap: fix zero diagonals" to true.

@jhux2 is working on a less intrusive diagonal boost for other purposes which could also be used for this.

dlkarnitz commented 7 years ago

@egboman I do want to solve a singular system. My concern was that in 12.10.1 my code did not crash, and now using the current develop branch it does crash. I had to change the coarse solve type to CHEBYSHEV to solve the system on the develop branch. My particular problem is the singular matrix that arises from a 3D Poisson problem.

egboman commented 7 years ago

Is the problem here a zero on the diagonal? There are other singular matrices...

dlkarnitz commented 7 years ago

@csiefer2 @egboman I do have a number of rows in my matrix that are all zero, I have not made an effort to trim them out.

csiefer2 commented 7 years ago

@dlkarnit If that's the only problem, then the "rap: fix zero diagonals" will fix that.

srajama1 commented 7 years ago

@dlkarnit : Are you saying you had zero rows and it worked in 12.10.1 ?

dlkarnitz commented 7 years ago

@srajama1 Yes.

@csiefer2 If I manually change the diagonal to 1 for the empty rows in the matrix set up I can use the default smoother for "Poisson-3D" without a crash. Which parameter list does the "rap: fix zero diagonals" need to go into? It did not work when I added it to the MueLu preconditioner parameter list.

jhux2 commented 7 years ago

@dlkarnit Yes, it goes in the MueLu parameter list. By "it did not work", do you mean that it had no effect, or that the MueLu parameter validator threw an exception?

dlkarnitz commented 7 years ago

@jhux2 It seems to have no effect, the code crashes in the same place.

csiefer2 commented 7 years ago

Huh. The fix zero diagonals should do basically what you did by hand.

@jhux2 Does the MueLu Dirichlet detection do the right thing for completely zero rows, or do you really need a one on the diagonal?

csiefer2 commented 7 years ago

To answer my own question --- the "rap: fix zero diagonals" setting checks for hard zeros on the diagonal (a blank row should trigger this),

If you manually do ones-and-zeros, these guys get eliminated via the Dirichlet detection.

If the rows are completely empty, the Dirichlet detection won't trigger, which might lead to a subtle error that "rap: fix zero diagonals" won't catch (still not 100% sure how, though).

I would recommend doing the manual ones-and-zeros,.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] commented 3 years ago

This issue was closed due to inactivity for 395 days.