Failing FO_GIS Analysis tests

ikalash commented 5 years ago

Seems the new tests that were added yesterday are failing:

FO_GIS_Analysis_BasalFriction FO_GIS_Analysis_StiffeningBasalFriction

http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=90632

jewatkins commented 5 years ago

Hmm the analysis seems to give one of two results depending on the build: http://cdash.sandia.gov/CDash-2-3-0/testSummary.php?project=10&name=FO_GIS_Analysis_BasalFriction&date=2019-11-07

Do you know if there are any clear differences between the two sets?

jewatkins commented 5 years ago

So the difference between the ones that are failing and passing is that the latter decomposes the mesh using: IOSS: Using decomposition method 'RIB' for 3,690 elements on 4 processors. The failing ones don't have that line.

jewatkins commented 5 years ago

I believe that has something to do with Use Serial Mesh

ikalash commented 5 years ago

So you're saying that you are having Albany auto-decompose the mesh by selecting 'Use Serial Mesh=true' in the input file and not providing a decomposed mesh? If you change 'Use Serial Mesh' to 'false' and decompose the mesh yourself for the failing tests, do they pass?

Seems things may have been calibrated using a different decomposition than the one produced by the decomp that gets called when 'Use Serial Mesh=true', though I would think that the result would be decomposition-insensitive (unless you're doing exodiff w.r.t. a gold file created using a different decomposition, of course, which you are not doing here).

mperego commented 5 years ago

The error seems pretty big. However the optimization process can be fairly chaotic. We may want to to check with Denis Ridzal or Drew Kouri to understand what we can expect in terms of reproducibility.

jewatkins commented 5 years ago

Use Serial Mesh seems to be an environment variable in the file (USE_SERIAL_MESH). Not sure how that works but it would be my first assumption to why there's a discrepancy (I'll check) and it matches with what's seen in the builds. It seems likely to me that a different decomposition would give a different value for the parameter vector (i.e. who knows if the first parameter in the vector corresponds to the same point in space).

mperego commented 5 years ago

Oh I see, we are checking the solution at every point. Wouldn't be better to compute some norm/response of the solution as in other tests?

jewatkins commented 5 years ago

We do that already with the solve. This is checking the parameter vector which currently checks each value one by one. We could compute a norm but it would require more work. We'd either have to change all the other analysis tests or create a new testing framework for it.

jewatkins commented 5 years ago

Just confirmed that's the issue. Different decompositions give different results. Even the gnorm starts to change after 5 iterations. One iteration would probably be okay though but it seems the testing wouldn't be as strict.

bartgol commented 5 years ago

USE_SERIAL_MESH is a cmake variable that we decide to turn on or off based on the Trilinos installation specs (in particular, you need Iopx to cut the mesh at run time, and you need seacas to cut the mesh from command line).

The check on the parameter entries should be handled in terms of GIDs: in the input file you should specify a handful of GIDs to check. The SolverFactory should then perform the check on the gids that are owned by the current rank.

ikalash commented 5 years ago

Ah I see, that would be decomposition then. There used to be an option in Main_Solve.cpp to export the solution to a serial map (so that proc 0 has the entire vector), and then write it to MM file, but I guess it got removed in some of the refactors this past year. It's in some older branches though, e.g. https://github.com/SNLComputation/Albany/blob/refactor-disc-fields/src/Main_SolveT.cpp#L550-L556 . I think here you don't have the solution but some other parameter vector - maybe you could do something similar so that the comparisons are done after the vector has been exported to live only on proc 0? That would take the parallel decomposition out of it.

jewatkins commented 5 years ago

Ah okay thanks.

Unfortunately, even if we were to change the testing to compare GIDs, I still think it would fail. If I grab the first value from one vector (serial mesh true), I can't match it to any other value in the other vector (serial mesh false).

bartgol commented 5 years ago

@jewatkins I'm not following. Are you saying that running a test with serial mesh true generates different numbers than a test run on an already decomposed mesh? I guess that's possible (the decomp utility from seacas might cut the mesh differently than the stk/ioss/exodus rebalance), but I would "hope" the difference is small, and we can tune the tolerance of the tests.

jewatkins commented 5 years ago

Piro/ROL gives me the parameter vector so I don't really have control over decomposition unless I muck around in there. I think the fact that the optimization is decomposition dependent should void the fact that we should be treating different decompositions as the same test though.

jewatkins commented 5 years ago

@bartgol Yes that's right. Flipping serial mesh on/off gives me a different parameter vector after one iteration. Serial false:

0:1.88116062e+00  
1:1.91906033e+00

Serial true:

IOSS: Using decomposition method 'RIB' for 3,690 elements on 4 processors.
...
  Before rebal nelements 0  923 
  Before rebal 0  548
  Before rebalance: Imbalance threshold is = 1.01921
  ZOLTAN Load balancing method = 3 (RCB)
  After rebalance: Imbalance threshold is = 1.03214
...
0:2.00000000e+00
1:1.97712126e+00

Hmm it's possible the decomposed mesh in the Albany repo needs to be updated to whatever the latest seacas decomposition does...

bartgol commented 5 years ago

We have some decomposed meshes in tests/small/LandIce/ExoMeshes but we don't use them. We always use the monolithic one. Then we either decompose it offline or online, depending on what the available resources in seacass/stk are.

bartgol commented 5 years ago

Are you printing the whole vector (i.e., it's a vector in R2) or are you sampling it?

jewatkins commented 5 years ago

The analysis input files use those meshes:

Exodus Input File Name: ../ExoMeshes/gis_unstruct_2d.exo

Or are you saying there's a script somewhere that goes in that directory and decomposes?

jewatkins commented 5 years ago

The whole vector gets printed (I think in Piro). AlbanyAnalysis grabs that vector and compares each value in the vector to values specified in the input. I just put in the first 100 values although the vector is of size 2087.

bartgol commented 5 years ago

That's the path in the build tree. The decomposed mesh in that folder in the build tree are generated by cmake. Although cmake does copy them to the build folder, they are overwritten by the commands run around line 30 of tests/small/LandIce/FO_GIS/CMakeLists.txt. Which means the decomposed gis meshes in tests/small/LandIce/ExoMeshes (in the src folder) could be removed from the repo...

jewatkins commented 5 years ago

Ah okay got it, good to know. Then it's not a matter of outdated files.

bartgol commented 5 years ago

The whole vector gets printed (I think in Piro). AlbanyAnalysis grabs that vector and compares each value in the vector to values specified in the input. I just put in the first 100 values although the vector is of size 2087.

Ah, but you have access to the parameter vector space, and with that, you can use the Thyra utils to access a specific GID, regardless of the vector distribution (if not on rank, simply move on). I'm just concerned that the test in SolverFactory is assuming the vector is replicated on all ranks, and maybe that messes up which numbers it's comparing...

jewatkins commented 5 years ago

Hmm that's true, I thought about that but never checked... let me see what each ranks has in terms of values.

FYI, I tried the write matrix thing but it didn't work. It said the type was incorrect. This is the type:

  p = Thyra::DefaultProductVector<double>{dim=2087}
   numBlocks=1
   Constituent vector objects v[0], v[1], ... v[numBlocks-1]:
    v[0] = Thyra::TpetraVector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >{spmdSpace=Thyra::TpetraVectorSpace<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >{globalDim=2087,localSubDim=545,localOffset=0,comm=Teuchos::MpiComm<long>{size=4,rank=0,rawMpiComm=0x9081220}}}

bartgol commented 5 years ago

Ah, I think we did not overload writeMatrixMarket for product vectors... It's probably something we should do, since obviously, the cast to Tpetra/Epetra vectors will fail. But if in your test you only have one block, you can call writeMatrixMarket on its first block.

jewatkins commented 5 years ago

Ah okay I figured it was something like that, thanks.

jewatkins commented 5 years ago

I confirmed that the vector that is printed contains all values across all ranks and that the comparison is done on the global vector (i.e. it uses DetachedVectorView and loops over the global vector). The distribution is indeed different in both cases (one could look at localSubDim of each rank to see the differences between cases).

@mperego Do you know if the algorithm used in ROL is decomposition dependent? We could probably do a comparison on the norm but it will probably be a lot less sensitive to changes in the test.

bartgol commented 5 years ago

@jewatkins sometimes even the fwd problem is somewhat sensitive to decomposition (you can try with 1 or 2 ranks, and see slightly different values). I think this stems from differences in the preconditioner (e.g., aggregates for ML/MueLu or schwarz decompositions for ifpack/ifpack2).

My hope was that the differences were "small" (I think the fwd problems have "small" differences), but what you said implies they're not.

One potential solution could be to always run the test serially. If run in parallel (with a predefined number of ranks), we should always use one method (decomp vs stk rebalance), and error out at config time if that method is not available due to trilinos config options. That's probably the best we can do.

Ideally, I'd like to avoid checks on norms/average/max/min. However, we already do that for the solution, so maybe is acceptable. But if we go that route, we need ROL to be converged (or near convergence).

jewatkins commented 5 years ago

Okay since the other tests are comparing single value responses and not choosing a specific decomposition to use, I'm inclined to just create the infrastructure to test the norm of the parameter vector just to be consistent with the other GIS tests. At least there's a difference between the norm of the initial condition (9.13673902e+01) and the value after one iteration (9.03392887e+01) so at least it's testing some change.

I think we can look at adding a more rigorous test at a later date. It will most likely require running the optimization for a longer time period.

sandialabs / Albany

Failing FO_GIS Analysis tests #533