Closed jhux2 closed 7 years ago
@allaffa
@jhux2 Which test?
@jhux2 Which test?
@mhoemmen Ifpack2's AdditiveSchwarz SuperLU unit test.
How was the test disabled? I looked at UnitTestAdditiveSchwarz.cpp and it doesn't seem to be commented out. I was worried because I pushed changes 6 days before this issue cropped up, even though my changes shouldn't have affected this test.
@brian-kelley @jhux2 I'm confused too. It looks like the SuperLU test in ifpack2/test/unit_tests/Ifpack2_UnitTestAdditiveSchwarz.cpp should run, as long as Amesos2 and SuperLU are enabled. The CMakeLists.txt file looks like it's building that .cpp file.
@brian-kelley You have to brag more when you push a big change; I had no idea ;-)
It wasn't really a major change, just the Container Factory for block relaxation. Additive Schwarz never uses containers, and I never touched any additive Schwarz related tests or sources.
The Ifpack2::AdditiveSchwarz SuperLU unit test is guarded by the macro HAVE_AMESOS2_SUPERLU. I never included Amesos2_config.h, however, so the macro is undefined and hence the test doesn't run.
If I set overlap=0, the test passes. Another AdditiveSchwarz unit test using RILUK as the subdomain solve passes regardless of the overlap. What is the difference in these two tests? In the failing one, the error is actually an exception thrown by Amesos2 that reports
x->getGlobalLength() != matrixA_->getGlobalNumCols()
This would seem to indicate that the local problem setup is different between the two.
I've reenabled the test with overlap=0 (commit cb801b6). It passes, but overlap>1 will make it fail. I also generalized the test to use either KLU2 or SuperLU.
Hm, Ifpack2's Amesos2 wrapper just lets Amesos2 apply the factorization; it does not use Ifpack2::LocalSparseTriangularSolver. Thus suggests that #558 is about overlap.
@jhux2 See commit https://github.com/trilinos/Trilinos/commit/46075d305ad43c906ff7bba84ec4c6d4488e4773, which includes a test showing that LocalSparseTriangularSolver is correct. Thus, the issue is either with the sparse factorization interface or with AdditiveSchwarz.
@ambrad did your recent changes fix this one?
@mhoemmen, no. I put overlap = 1 back in the test, and it still throws as reported above.
@ambrad Thanks for the update! I'm looking at Ifpack2's Amesos2 wrapper now to see if it has the same problem that LocalSparseTriangularSolver had with respect to Imports.
Weird things about Ifpack2's Amesos2 wrapper:
RCP<const Matrix>
, so it promises not to change the matrix.)For some reason, this test wasn't ever building for me. Trying again, and explicitly enabling Amesos2 this time.
I built the test, verified that it built (including the SuperLU test), and it passes for me. @jhux2, could you show me exactly what test is failing, and what it prints when it fails?
I built the test, verified that it built (including the SuperLU test), and it passes for me. @jhux2, could you show me exactly what test is failing, and what it prints when it fails?
Did you change the overlap to 1 in test SparseDirectSolver? See the comment around line 572.
@jhux2 wrote:
Did you change the overlap to 1 in test SparseDirectSolver? See the comment around line 562.
Ah, I see, this is not the Amesos2 wrapper test. I am looking at Ifpack2_UnitTestAdditiveSchwarz.cpp now.
I'm sorry I don't know what this is about and why I'm receiving the emails. Swing by my desk tomorrow so we can connect please. Thanks :)
Sent via the Samsung Galaxy S7 edge, an AT&T 4G LTE smartphone
-------- Original message -------- From: Mark Hoemmen notifications@github.com Date: 10/4/16 5:53 PM (GMT-06:00) To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Joshua Hux jhux@collaborative.com, Mention mention@noreply.github.com Subject: Re: [trilinos/Trilinos] problem with Ifpack2::AdditiveSchwarz and sparse direct subdomain solves (#460)
@jhux2https://github.com/jhux2 wrote:
Did you change the overlap to 1 in test SparseDirectSolver? See the comment around line 562.
Ah, I see, this is not the Amesos2 wrapper test. I am looking at Ifpack2_UnitTestAdditiveSchwarz.cpp now.
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/460#issuecomment-251536929, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI48Ne66ZgXmMBSmlXvfYg5Rn1sUdtutks5qwtkCgaJpZM4I9TUT.
You were @-mentioned by mistake. I fixed the comment in question. Please unsubscribe from this issue.
@jhux2 I changed the overlap to 1 (around line 562 of Ifpack2_UnitTestAdditiveSchwarz.cpp, as you mentioned) and got the test to fail.
I'm wondering if the same thing as in @ambrad 's fix for RILUK's use of LocalSparseTriangularSolver in AdditiveSchwarz (get the Maps right) will work here. I started trying it out.
@bmpersc I thought be89a0d was just moving the test, not a fix.
@jhux2 this seems to have been an automated thing that github did based on the commit log in be89a0d where it said "This will help us fix #460." Apparently having fix and a bug number are enough to automatically close a ticket.
@jhux2 this seems to have been an automated thing that github did based on the commit log in be89a0d where it said "This will help us fix #460." Apparently having fix and a bug number are enough to automatically close a ticket.
@bmpersc Ok, good to know.
@jhux2 explained just now that the "user" in his original report was a summer student working with Ray.
I think I solved this issue. In AdditiveSchwarz::apply(), with overlap, the OverlappingB and OverlappingY vectors are still distributed across all procs when localApply() is called (which in turn runs the inner solver). For some reason, Amesos2 expects those vectors to have global length matching the nrows of the local matrix, which didn't match on parallel problems. I just copied the local entries of OverlappingB and OverlappingY into non-distributed MVs and it worked with or without overlap (0, 1 or 2).
Awesome! Thanks @brian-kelley ! I think you could even just have the non-distributed MVs view the data of the distributed MVs, without needing a copy.
Alright, so I did fix the SparseDirectSolver test completely, but I wanted to get your input before checking it in. (See my fork, branch SchwarzFix460).
There's definitely some weirdness with the way Amesos2 validates solver input. I had create new Tpetra::Maps that are "locally replicated" and have the same number of global rows as the real B/Y have local rows, then make a shallow copy into new multivectors (one map and one new MV per process). Then I also had to run the solver one column at a time - passing the whole MV at once still didn't work. All of this is done if I see that the "subdomain solver name" parameter is "AMESOS2".
So I have a fix but it's pretty kludgy and there's probably a better way.
Can you @trilinos/amesos2 team a test case. I also don't quite get what you mean about the weirdness. If you want to stop by please feel free. I would like to understand this better.
I was wrong, there's nothing wrong with Amesos2. Ifpack2's Amesos2Wrapper just detected parallel problems incorrectly, by only checking whether A_'s row map's comm is > 1 process . If A is local only (as with AdditiveSchwarz), the wrapper thought the problem was serial and did not create local versions of X and Y. I just made it also check X's and Y's maps to see if they are distributed.
I made a PR to make sure my changes are OK.
The change is checked in so this test is fixed.
Hi All, could this fix be related to my observation from several months ago that the subdomain solves with the direct solver were requiring too much memory? Basically, I noticed that the total memory use for the factorizations, instead of going down with the increasing number of subdomains, went up significantly.
@dridzal That's a good idea, could you run your problem again (with this fix) and report back?
When I tried to profile MueLu + AdditiveSchwarz + KLU2 to diagnose your bad memory scaling, I didn't actually replicate the issue you saw. Once I got my parameter list right, the total memory usage stayed almost exactly constant with 1, 2, 4 or 8 processes.
Edit: I was using this parameter list with muelu/test/scaling/MueLu_Driver.exe.
@brian-kelley I was using a very similar parameter list, so I don't think that's the issue. I have a question about your comment "memory usage stayed almost exactly constant". Take the extreme case where the LU factorization always results in fully dense triangular factors (we have a case close to this). Then, as you increase the number of subdomains, the memory use should go down. In fact, for N subdomains you should only require 1/N-th of the memory to store the factors.
Structurally, DD with 3 subdomains and no overlap should transform the dense factor
x x x x x x x x x x x x x x x x x x x x x
into
x x x o o x o o x x o o o o x o o o o x x
Could you please comment?
@dridzal Yes, I agree with everything you said - you should see total memory decrease with more subdomains. The test runs I used definitely didn't result in fully dense triangular factors - I was using a very simple and very sparse 500x500 2D Laplacian problem, which has 1,248,000 nonzeros. For reference, it takes 10 MB to store those double values directly. Here is how much memory the AdditiveSchwarz and KLU2 compute() used:
Processes aka subdomains | Per process (MB) | Total (MB) |
---|---|---|
1 | 18.1 | 18.1 |
2 | 9.2 | 18.4 |
4 | 4.56 | 18.24 |
8 | 2.4 | 19.2 |
When I say "memory used", I mean total allocated blocks after minus total allocated blocks before. Edit: I used overlap=1 by the way.
@dridzal I don't really know the mathematical differences between KLU2 and ILU, but in case this is useful I just tried running the exact same matrix through ILU(0) in matlab and L and U were each 9.988 MB (full CRS) - A was 15.98 MB by itself. However, running full LU produced L and U that were each 1.5 GB, which is not fully dense but clearly much more dense.
@brian-kelley I agree that your example is too simple. It looks like your LU factor size is essentially linear in the number of variables, and not quadratic as we would expect for more realistic problems. My guess is that the factors are essentially banded, possibly with only a few subdiagonals or superdiagonals. In this case, I don't expect any memory savings. Do you have a more challenging example?
I could look around Galeri for a better example. Actually, the muelu scaling test driver might allow that already, I'll have a look tomorrow.
@dridzal Could you provide more description on your example where you get dense factors? Galeri examples provide only standard 2D/3D Laplacian and elasticity problems which would certainly not reproduce that.
@aprokop Are there any existing MueLu tests that use a denser matrix?
@brian-kelley Not that I know of. The densest ones are probably elasticity and possibly the high order stuff Chris is using.
@aprokop @brian-kelley I need to solve a fully coupled thermal-fluids problem in 3D. The @trilinos/rol team have a 2D implementation right now in the rol/examples/PDE-OPT/thermal-fluid directory. We'll have the 3D implementation done in a few weeks. However, aren't there Matrix Market examples that you could try? Something like
http://math.nist.gov/MatrixMarket/data/SPARSKIT/fidap/fidapm37.html
You would have to partition the problem somehow for parallel studies, but my guess is that Zoltan can help. Basically, any 3D example with slightly more interesting physics (fluids, structures) and a few hundred nonzeros per row would do.
We have support in MueLu for reading in the MatrixMarket and running multigrid hierarchy with it. This way we could use a single level multigrid preconditioner with a describe Ifpack2 smoother for it. @brian-kelley What do you think?
Another great test may be a matrix resulting from a frequency-domain edge-element discretization of Maxwell's equation. The @trilinos/rol team will be able to provide a test matrix in the next month or so. It is possible that @csiefer2 already has such an example.
@aprokop Yes, I could do that.
I have a simple problem being set up with fidapm37.mtx from the link above. 9152 rows, 765944 nonzeros. Here are the mem stats from just the additive schwarz inner preconditioner compute(). This is with the same parameter list from above - overlap=1. So, more procs decreases the total memory as expected for a more dense matrix.
Processes aka subdomains | Per process (MB) | Total (MB) |
---|---|---|
1 | 9.36 | 9.36 |
2 | 4.16 | 8.32 |
4 | 1.89 | 7.56 |
8 | 0.76 | 6.08 |
Edit: I didn't do any extra partitioning or load balancing - just used a default distributed map with the right number of global rows. I gave averages in the table. For the 8 proc run the min was 0.623 MB and the max was 1.067 MB.
@brian-kelley Is there a way we could make this a test that runs nightly? Not saying you should necessarily, just wondering if it's possible.
Great, this is going in the right direction -- thanks for following up. I think that for more complex problems, the savings would be even greater. I hope to provide a few examples soon. Two quick questions: 1) Are the subdomain solves and the serial solve all using the same permutation strategy for the factorization? 2) Do the numbers change significantly with overlap=0? As a side note, when I last used Additive Schwarz with Amesos2, I could only select a non-overlapping DD. I guess this has been fixed, right?
A user has reported that Ifpack2's additive Schwarz with sparse direct solves from Amesos2 does not work properly as a smoother in MueLu. After an initial look, I discovered that the Ifpack2 test that tests this was accidentally disabled (I don't know when). I enabled it locally and found the test fails.