problem with Ifpack2::AdditiveSchwarz and sparse direct subdomain solves

jhux2 commented 8 years ago

A user has reported that Ifpack2's additive Schwarz with sparse direct solves from Amesos2 does not work properly as a smoother in MueLu. After an initial look, I discovered that the Ifpack2 test that tests this was accidentally disabled (I don't know when). I enabled it locally and found the test fails.

jhux2 commented 8 years ago

@allaffa

mhoemmen commented 8 years ago

@jhux2 Which test?

jhux2 commented 8 years ago

@jhux2 Which test?

@mhoemmen Ifpack2's AdditiveSchwarz SuperLU unit test.

brian-kelley commented 8 years ago

How was the test disabled? I looked at UnitTestAdditiveSchwarz.cpp and it doesn't seem to be commented out. I was worried because I pushed changes 6 days before this issue cropped up, even though my changes shouldn't have affected this test.

mhoemmen commented 8 years ago

@brian-kelley @jhux2 I'm confused too. It looks like the SuperLU test in ifpack2/test/unit_tests/Ifpack2_UnitTestAdditiveSchwarz.cpp should run, as long as Amesos2 and SuperLU are enabled. The CMakeLists.txt file looks like it's building that .cpp file.

mhoemmen commented 8 years ago

@brian-kelley You have to brag more when you push a big change; I had no idea ;-)

brian-kelley commented 8 years ago

It wasn't really a major change, just the Container Factory for block relaxation. Additive Schwarz never uses containers, and I never touched any additive Schwarz related tests or sources.

jhux2 commented 8 years ago

The Ifpack2::AdditiveSchwarz SuperLU unit test is guarded by the macro HAVE_AMESOS2_SUPERLU. I never included Amesos2_config.h, however, so the macro is undefined and hence the test doesn't run.

jhux2 commented 8 years ago

If I set overlap=0, the test passes. Another AdditiveSchwarz unit test using RILUK as the subdomain solve passes regardless of the overlap. What is the difference in these two tests? In the failing one, the error is actually an exception thrown by Amesos2 that reports

x->getGlobalLength() != matrixA_->getGlobalNumCols()

This would seem to indicate that the local problem setup is different between the two.

jhux2 commented 8 years ago

I've reenabled the test with overlap=0 (commit cb801b6). It passes, but overlap>1 will make it fail. I also generalized the test to use either KLU2 or SuperLU.

mhoemmen commented 8 years ago

Hm, Ifpack2's Amesos2 wrapper just lets Amesos2 apply the factorization; it does not use Ifpack2::LocalSparseTriangularSolver. Thus suggests that #558 is about overlap.

mhoemmen commented 8 years ago

@jhux2 See commit https://github.com/trilinos/Trilinos/commit/46075d305ad43c906ff7bba84ec4c6d4488e4773, which includes a test showing that LocalSparseTriangularSolver is correct. Thus, the issue is either with the sparse factorization interface or with AdditiveSchwarz.

mhoemmen commented 8 years ago

@ambrad did your recent changes fix this one?

ambrad commented 8 years ago

@mhoemmen, no. I put overlap = 1 back in the test, and it still throws as reported above.

mhoemmen commented 8 years ago

@ambrad Thanks for the update! I'm looking at Ifpack2's Amesos2 wrapper now to see if it has the same problem that LocalSparseTriangularSolver had with respect to Imports.

mhoemmen commented 8 years ago

Weird things about Ifpack2's Amesos2 wrapper:

Funny FIXME comments in setMatrix (I just contacted @jhux2 to ask)
Why does initialize() copy the matrix? Why does it do so using a weird Import? (Amesos2 should just be able to take the matrix directly; it accepts the matrix as an RCP<const Matrix>, so it promises not to change the matrix.)
It looks like initialize() does initialization that depends on the matrix's values
Why does apply() create offset views? Amesos2 should be able to handle any necessary communication or remapping on its own.

mhoemmen commented 8 years ago

For some reason, this test wasn't ever building for me. Trying again, and explicitly enabling Amesos2 this time.

mhoemmen commented 8 years ago

I built the test, verified that it built (including the SuperLU test), and it passes for me. @jhux2, could you show me exactly what test is failing, and what it prints when it fails?

jhux2 commented 8 years ago

I built the test, verified that it built (including the SuperLU test), and it passes for me. @jhux2, could you show me exactly what test is failing, and what it prints when it fails?

Did you change the overlap to 1 in test SparseDirectSolver? See the comment around line 572.

mhoemmen commented 8 years ago

@jhux2 wrote:

Did you change the overlap to 1 in test SparseDirectSolver? See the comment around line 562.

Ah, I see, this is not the Amesos2 wrapper test. I am looking at Ifpack2_UnitTestAdditiveSchwarz.cpp now.

jhux commented 8 years ago

I'm sorry I don't know what this is about and why I'm receiving the emails. Swing by my desk tomorrow so we can connect please. Thanks :)

Sent via the Samsung Galaxy S7 edge, an AT&T 4G LTE smartphone

-------- Original message -------- From: Mark Hoemmen notifications@github.com Date: 10/4/16 5:53 PM (GMT-06:00) To: trilinos/Trilinos Trilinos@noreply.github.com Cc: Joshua Hux jhux@collaborative.com, Mention mention@noreply.github.com Subject: Re: [trilinos/Trilinos] problem with Ifpack2::AdditiveSchwarz and sparse direct subdomain solves (#460)

@jhux2https://github.com/jhux2 wrote:

Did you change the overlap to 1 in test SparseDirectSolver? See the comment around line 562.

Ah, I see, this is not the Amesos2 wrapper test. I am looking at Ifpack2_UnitTestAdditiveSchwarz.cpp now.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/460#issuecomment-251536929, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AI48Ne66ZgXmMBSmlXvfYg5Rn1sUdtutks5qwtkCgaJpZM4I9TUT.

mhoemmen commented 8 years ago

You were @-mentioned by mistake. I fixed the comment in question. Please unsubscribe from this issue.

mhoemmen commented 8 years ago

@jhux2 I changed the overlap to 1 (around line 562 of Ifpack2_UnitTestAdditiveSchwarz.cpp, as you mentioned) and got the test to fail.

I'm wondering if the same thing as in @ambrad 's fix for RILUK's use of LocalSparseTriangularSolver in AdditiveSchwarz (get the Maps right) will work here. I started trying it out.

jhux2 commented 8 years ago

@bmpersc I thought be89a0d was just moving the test, not a fix.

bmpersc commented 8 years ago

@jhux2 this seems to have been an automated thing that github did based on the commit log in be89a0d where it said "This will help us fix #460." Apparently having fix and a bug number are enough to automatically close a ticket.

jhux2 commented 8 years ago

@jhux2 this seems to have been an automated thing that github did based on the commit log in be89a0d where it said "This will help us fix #460." Apparently having fix and a bug number are enough to automatically close a ticket.

@bmpersc Ok, good to know.

mhoemmen commented 8 years ago

@jhux2 explained just now that the "user" in his original report was a summer student working with Ray.

brian-kelley commented 7 years ago

I think I solved this issue. In AdditiveSchwarz::apply(), with overlap, the OverlappingB and OverlappingY vectors are still distributed across all procs when localApply() is called (which in turn runs the inner solver). For some reason, Amesos2 expects those vectors to have global length matching the nrows of the local matrix, which didn't match on parallel problems. I just copied the local entries of OverlappingB and OverlappingY into non-distributed MVs and it worked with or without overlap (0, 1 or 2).

mhoemmen commented 7 years ago

Awesome! Thanks @brian-kelley ! I think you could even just have the non-distributed MVs view the data of the distributed MVs, without needing a copy.

brian-kelley commented 7 years ago

Alright, so I did fix the SparseDirectSolver test completely, but I wanted to get your input before checking it in. (See my fork, branch SchwarzFix460).

There's definitely some weirdness with the way Amesos2 validates solver input. I had create new Tpetra::Maps that are "locally replicated" and have the same number of global rows as the real B/Y have local rows, then make a shallow copy into new multivectors (one map and one new MV per process). Then I also had to run the solver one column at a time - passing the whole MV at once still didn't work. All of this is done if I see that the "subdomain solver name" parameter is "AMESOS2".

So I have a fix but it's pretty kludgy and there's probably a better way.

srajama1 commented 7 years ago

Can you @trilinos/amesos2 team a test case. I also don't quite get what you mean about the weirdness. If you want to stop by please feel free. I would like to understand this better.

brian-kelley commented 7 years ago

I was wrong, there's nothing wrong with Amesos2. Ifpack2's Amesos2Wrapper just detected parallel problems incorrectly, by only checking whether A_'s row map's comm is > 1 process . If A is local only (as with AdditiveSchwarz), the wrapper thought the problem was serial and did not create local versions of X and Y. I just made it also check X's and Y's maps to see if they are distributed.

I made a PR to make sure my changes are OK.

brian-kelley commented 7 years ago

The change is checked in so this test is fixed.

dridzal commented 7 years ago

Hi All, could this fix be related to my observation from several months ago that the subdomain solves with the direct solver were requiring too much memory? Basically, I noticed that the total memory use for the factorizations, instead of going down with the increasing number of subdomains, went up significantly.

brian-kelley commented 7 years ago

@dridzal That's a good idea, could you run your problem again (with this fix) and report back?

When I tried to profile MueLu + AdditiveSchwarz + KLU2 to diagnose your bad memory scaling, I didn't actually replicate the issue you saw. Once I got my parameter list right, the total memory usage stayed almost exactly constant with 1, 2, 4 or 8 processes.

Edit: I was using this parameter list with muelu/test/scaling/MueLu_Driver.exe.

dridzal commented 7 years ago

@brian-kelley I was using a very similar parameter list, so I don't think that's the issue. I have a question about your comment "memory usage stayed almost exactly constant". Take the extreme case where the LU factorization always results in fully dense triangular factors (we have a case close to this). Then, as you increase the number of subdomains, the memory use should go down. In fact, for N subdomains you should only require 1/N-th of the memory to store the factors.

Structurally, DD with 3 subdomains and no overlap should transform the dense factor

x x x x x x x x x x x x x x x x x x x x x

into

x x x o o x o o x x o o o o x o o o o x x

Could you please comment?

brian-kelley commented 7 years ago

@dridzal Yes, I agree with everything you said - you should see total memory decrease with more subdomains. The test runs I used definitely didn't result in fully dense triangular factors - I was using a very simple and very sparse 500x500 2D Laplacian problem, which has 1,248,000 nonzeros. For reference, it takes 10 MB to store those double values directly. Here is how much memory the AdditiveSchwarz and KLU2 compute() used:

Processes aka subdomains	Per process (MB)	Total (MB)
1	18.1	18.1
2	9.2	18.4
4	4.56	18.24
8	2.4	19.2

When I say "memory used", I mean total allocated blocks after minus total allocated blocks before. Edit: I used overlap=1 by the way.

brian-kelley commented 7 years ago

@dridzal I don't really know the mathematical differences between KLU2 and ILU, but in case this is useful I just tried running the exact same matrix through ILU(0) in matlab and L and U were each 9.988 MB (full CRS) - A was 15.98 MB by itself. However, running full LU produced L and U that were each 1.5 GB, which is not fully dense but clearly much more dense.

dridzal commented 7 years ago

@brian-kelley I agree that your example is too simple. It looks like your LU factor size is essentially linear in the number of variables, and not quadratic as we would expect for more realistic problems. My guess is that the factors are essentially banded, possibly with only a few subdiagonals or superdiagonals. In this case, I don't expect any memory savings. Do you have a more challenging example?

brian-kelley commented 7 years ago

I could look around Galeri for a better example. Actually, the muelu scaling test driver might allow that already, I'll have a look tomorrow.

aprokop commented 7 years ago

@dridzal Could you provide more description on your example where you get dense factors? Galeri examples provide only standard 2D/3D Laplacian and elasticity problems which would certainly not reproduce that.

brian-kelley commented 7 years ago

@aprokop Are there any existing MueLu tests that use a denser matrix?

aprokop commented 7 years ago

@brian-kelley Not that I know of. The densest ones are probably elasticity and possibly the high order stuff Chris is using.

dridzal commented 7 years ago

@aprokop @brian-kelley I need to solve a fully coupled thermal-fluids problem in 3D. The @trilinos/rol team have a 2D implementation right now in the rol/examples/PDE-OPT/thermal-fluid directory. We'll have the 3D implementation done in a few weeks. However, aren't there Matrix Market examples that you could try? Something like

http://math.nist.gov/MatrixMarket/data/SPARSKIT/fidap/fidapm37.html

You would have to partition the problem somehow for parallel studies, but my guess is that Zoltan can help. Basically, any 3D example with slightly more interesting physics (fluids, structures) and a few hundred nonzeros per row would do.

aprokop commented 7 years ago

We have support in MueLu for reading in the MatrixMarket and running multigrid hierarchy with it. This way we could use a single level multigrid preconditioner with a describe Ifpack2 smoother for it. @brian-kelley What do you think?

dridzal commented 7 years ago

Another great test may be a matrix resulting from a frequency-domain edge-element discretization of Maxwell's equation. The @trilinos/rol team will be able to provide a test matrix in the next month or so. It is possible that @csiefer2 already has such an example.

brian-kelley commented 7 years ago

@aprokop Yes, I could do that.

brian-kelley commented 7 years ago

I have a simple problem being set up with fidapm37.mtx from the link above. 9152 rows, 765944 nonzeros. Here are the mem stats from just the additive schwarz inner preconditioner compute(). This is with the same parameter list from above - overlap=1. So, more procs decreases the total memory as expected for a more dense matrix.

Processes aka subdomains	Per process (MB)	Total (MB)
1	9.36	9.36
2	4.16	8.32
4	1.89	7.56
8	0.76	6.08

Edit: I didn't do any extra partitioning or load balancing - just used a default distributed map with the right number of global rows. I gave averages in the table. For the 8 proc run the min was 0.623 MB and the max was 1.067 MB.

mhoemmen commented 7 years ago

@brian-kelley Is there a way we could make this a test that runs nightly? Not saying you should necessarily, just wondering if it's possible.

dridzal commented 7 years ago

Great, this is going in the right direction -- thanks for following up. I think that for more complex problems, the savings would be even greater. I hope to provide a few examples soon. Two quick questions: 1) Are the subdomain solves and the serial solve all using the same permutation strategy for the factorization? 2) Do the numbers change significantly with overlap=0? As a side note, when I last used Additive Schwarz with Amesos2, I could only select a non-overlapping DD. I guess this has been fixed, right?

trilinos / Trilinos

problem with Ifpack2::AdditiveSchwarz and sparse direct subdomain solves #460