trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

Nalu diffs due to MueLu_Amesos2Smoother change #12383

Closed spdomin closed 10 months ago

spdomin commented 11 months ago

Bug Report

New diffs in Nalu's nightly test suite due to recent Muelu changes.

Good: 7da08904d1d2a40ae8c33338d22ae4e9bfb99620
Bad: 9c30d791fe38aac21669aa9337d672804ff649b9

Bisect noted @cgcgcg:

eebc8109506cd267837b15d2fe342105d74c2460 is the first bad commit
commit eebc8109506cd267837b15d2fe342105d74c2460
Author: Christian Glusa <caglusa@sandia.gov>
Date:   Fri Sep 29 12:42:23 2023 -0600

    MueLu_Amesos2Smoother: Fixes for projection and non-contiguous maps

    - Projection was incorrect for nullspaces of dim>1.
    - Check for non-contiguous map was insufficient.

This one looks like a bug fix. Let me know if: 1) the change is intended, 2) improves code quality, 3) Nalu diffs should be accepted.

github-actions[bot] commented 11 months ago

Automatic mention of the @trilinos/muelu team

cgcgcg commented 11 months ago

@spdomin Thanks for reporting this. How significant are the diffs? We've had some applications bug out due to a loss of convergence, but that should be fixed. I'm taking it that that's not the case here?

spdomin commented 11 months ago

The loss of convergence that I recently reported was from a new eigenvalue approach that Chris pushed, however, I thought I closed that issue once we agreed it was a better change (I also increased the Chebyshev eigenvalue iteration count). Perhaps you mean another change - or this one?

In my cases, I do not see large diffs - just order-of-opps-like magnitude. In fact, only the "strict" gnu platform found the diffs, e.g., Mac is more lax in diff checking.

Moreover, in most all of the cases I spot checked, the average linear solver iterations for the continuity solve were the same, although this was not an exhaustive check since it is currently done by hand with a "diff" or "tkdiff".

cgcgcg commented 11 months ago

No, I managed to break Albany with the commit you referenced. I did push a fix the next day, but wanted to make sure it's not that. If there is indeed no change in iterations I would suggest to approve the diff.

spdomin commented 11 months ago

Okay, I suppose I will check the diffs between these two changes to make sure all is well.

To be clear, the latest code base in Muelu is an improvement to the previous implementation, correct? If so, I can proceed with the re-bless.

cgcgcg commented 11 months ago

Yes. The commit and the subsequent had two changes:

spdomin commented 11 months ago

I will proceed with the re-bless. Before I do that, do you have the set of Muelu XML parameters that activate this new code path/formulation? It looks like the amount of cases diffing is much smaller than the full set of Muelu cases.

An XML is under:

https://github.com/NaluCFD/Nalu/blob/master/reg_tests/xml/milestone.xml

I have not checked if all of the diffs point to this xml, however, will do that while the cases are running.

cgcgcg commented 11 months ago

In order to use the pseudo-inverse you would need to set

<ParameterList name="coarse: params">
   <Parameter name="fix nullspace" type="bool" value="true"/>
</ParameterList>

I would only recommend using that if you are 100% sure that the near-nullspace of the multigrid preconditioner is the actual nullspace because you are solving a singular problem. For example, Poisson with pure Neumann is such a case. You are specifying (maybe implicitely because MueLu might do that for you) that the near-nullspace is the constant vector.

For the non-contiguous maps nothing has to be changed.

spdomin commented 11 months ago

Hmmm. None of the cases that diffed were a singular system. We only have one test for that, and it was the one that was failing from the eignevalue iteration change a couple of weeks ago.

spdomin commented 11 months ago

Moreover, I have clearly not activated this option. Finally, I do not believe we have non-contiguous maps? Did something else change?

cgcgcg commented 11 months ago

Ok, just to make sure: eebc8109506cd267837b15d2fe342105d74c2460 gives you a "bad" result. Can you also check 1d1614663159e74d9bbb05bdc914b61c3500a0b4? That's the fix for the previous commit.

spdomin commented 11 months ago

It did not seem that the diff magnitude changed over the past week.

cgcgcg commented 11 months ago

So the diff appeared with eebc8109506cd267837b15d2fe342105d74c2460 and did not change since? Weird.

cgcgcg commented 11 months ago

Looking at eebc8109506cd267837b15d2fe342105d74c2460 the first two chunks should not be active since you're not setting "fix nullspace". So the diff must be due to the last chunk that deals with non-contiguous maps. You could revert that and see if that is indeed the culprit.

spdomin commented 11 months ago

Could you refresh my memory on what constitutes a non-contiguous map? I know that in certain use cases, such as sliding mesh, overset, or periodic, we may have some sort of advanced matrix connectivity. However, many of the cases that failed are routine cases with standard inflow, open, and wall boundaries. One was a hybrid mesh for a duct flow.

Any info would be nice to have so that we can all be sure that nothing else changed unknowingly. Possibly, one of the smallest cases might have an opportunity to drop a matrix file.

cgcgcg commented 11 months ago

Have a look at the section "Contiguous or noncontiguous" in the Tpetra documentation: https://docs.trilinos.org/dev/packages/tpetra/doc/html/classTpetra_1_1Map.html

spdomin commented 11 months ago

@alanw0, could you please add perspective on why a routine Nalu inflow/open/wall bc case would have a noncontiguous map? Is this something that STK might be "suggesting" due to global id mapping?

For a review, new Nalu diffs showed up due to a recent Muelu change that touched applications that have noncontinuous maps.

alanw0 commented 11 months ago

We construct a Tpetra_Map like this, on line 317 of TpetraLinearSystem.C in nalu:

ownedRowsMap_ = Teuchos::rcp(new LinSys::Map(Teuchos::OrdinalTraits<Tpetra::global_size_t>::invalid(), ownedGids, 1, tpetraComm));

The ownedGids may indeed be noncontigous, those are the global-IDs coming from the exodus mesh. They can be arbitrary, you could have a mesh of 10 elements with ids starting from 1 billion, etc. But I was under the impression that Tpetra Maps internally map the user IDs to a (globally) zero-based contiguous set of ids. Is a Tpetra Map non contiguous based on the user IDs?

alanw0 commented 11 months ago

@spdomin for the diffing case, is it diffing on 1 processor, or only diffing in parallel? On 1 processor, the ownedGids would be contiguous, whereas in parallel they may be messed up by the decomposition and all bets are off. You can ncdump the un-decomposed exodus mesh and see if it has a node map. If not, then the ids are contiguous in serial. If it has a node map, then that will also show the (potentially arbitrary) global-ids.

spdomin commented 11 months ago

Hi @alanw0, all of the cases are parallel. I can try a test in serial - once we understand the rules of what is and is not contiguous. Based on how the global ids are served up from Exodus, we may never have a contiguous system:)

Thanks for chiming in:)

cgcgcg commented 11 months ago

There should be no change in behavior in parallel.

spdomin commented 11 months ago

Okay, this saves me extra testing, however, we are still not sure why these standard cases are interpreted as non-contiguous maps. Again, the premise set forth is that diffs will only show up when the maps are non-contiguous. As such, the open question is why they are in these simple cases.

I care about this since if the map is contiguous, then something else changed to cause the diffs.

spdomin commented 10 months ago

Diffs accepted; still some questions on maps, however, we can take this offline.