Inconsistent results when using different numbers of MPI ranks in ALI

jewatkins commented 3 years ago

I ran into an interesting result when comparing linear convergence and solution average for different numbers of MPI ranks in Albany Land Ice:

greenland, 1-10km, ML preconditioner, 320 ranks:
Total linear iterations in first nonlinear iteration: 16
Solution average after nonlinear convergence: -9.053335672528e-01

greenland, 1-10km, ML preconditioner, 32 ranks:
Total linear iterations in first nonlinear iteration: 76
Solution average after nonlinear convergence: 2.926659144639e+00

The difference in linear convergence and solution average seems pretty large. My nonlinear tolerance is set to 1.0e-6 and my linear tolerance is set to 1.0e-8. I didn't see much of a difference when reducing these numbers. I also didn't see much of a difference when I tried this same experiment on smaller problems. I'm still investigating to see what might cause this but let me know if anyone has any thoughts.

jewatkins commented 3 years ago

A smaller problem for comparison, after one iteration:

humboldt, 3-20km, ML preconditioner, 20 ranks:
Total linear iterations in first nonlinear iteration: 25
Solution average after nonlinear convergence: 0.1072498668

humboldt, 3-20km, ML preconditioner, 2 ranks:
Total linear iterations in first nonlinear iteration: 23
Solution average after nonlinear convergence: 0.107249664344

mperego commented 3 years ago

The Humboldt results look reasonable to me.

jewatkins commented 3 years ago

I checked the difference between the Jacobian in the 2 rank case vs. the 20 rank case:

max value: 3.7253e-09
norm: 1.7459e-08

The results are similar if I use homogeneous DBC on the basalside with no BC on lateral: 3.7253e-09 and 1.6862e-08

bartgol commented 3 years ago

Is this an absolute difference or a relative one?

jewatkins commented 3 years ago

absolute

bartgol commented 3 years ago

Ok. It might be pointing to a wrongly assembled jacobian.

Meanwhile, I double checked a basic pb (SteadyHeat2D), running it with 1 and 32 ranks: the NOX residuals start to be different (1e-3 relative diff) only at the 3rd nonlinear iteration. HOWEVER, the linear solver history is quite different already with the 1st jacobian: 16 (serial) vs 43 (32 ranks) iterations.

Edit: this is a Tpetra run, with Belos solver and Ifpack2 prec (ILUT with fill-in 1).

bartgol commented 3 years ago

However, J_serial-J_parallel ~ 1e-16.

So this is interesting: the jacobian is basically the same, and yet we get very different convergence histories...

I'm not an expert of Ifpack, but: when using several ranks, some subdomain contains no Dirichlet rows. Could this deteriorate solver performance (due to singular or almost singular local matrix)?

bartgol commented 3 years ago

Using MueLu (instead of Ifpack2) gives the same linear iterations (serial vs 32-ranks). Like

serial:

  *******************************************************
  ***** Belos Iterative Solver:  Block Gmres 
  ***** Maximum Iterations: 100
  ***** Block Size: 1
  ***** Residual Test: 
  *****   Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Res0), tol = 0.0001
  *******************************************************
  Iter   0, [ 1] :    1.000000e+00
  Iter   9, [ 1] :    3.158400e-05

vs 32 ranks:

  *******************************************************
  ***** Belos Iterative Solver:  Block Gmres 
  ***** Maximum Iterations: 100
  ***** Block Size: 1
  ***** Residual Test: 
  *****   Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Res0), tol = 0.0001
  *******************************************************
  Iter   0, [ 1] :    1.000000e+00
  Iter   9, [ 1] :    2.945532e-05

This does not explain the ALI tests failures, where the prec was ML or MueLu. But at least confirms that the core part of Albany is not buggy. It also suggests that the choice of the preconditioner type might be affecting convergence in some cases.

Edit: for completeness, these are the Ifpack2 and MueLu params:

Ifpack2:

                Ifpack2: 
                  Overlap: 1
                  Prec Type: ILUT
                  Ifpack2 Settings: 
                    'fact: drop tolerance': 0.00000000000000000e+00
                    'fact: ilut level-of-fill': 1.00000000000000000e+00
                    'fact: level-of-fill': 1

MueLu:

                MueLu: 
                  verbosity: none
                  max levels: 5
                  'coarse: max size': 512 
                  multigrid algorithm: sa
                  'aggregation: type': uncoupled
                  'smoother: type': RELAXATION
                  'smoother: params': 
                    'relaxation: type': Jacobi
                    'relaxation: sweeps': 1
                    'relaxation: damping factor': 2.5e-01

jewatkins commented 3 years ago

Thanks for those results. FYI, if I use relative difference it's 1.0446e-13 and 2.8920e-13.

bartgol commented 3 years ago

1e-13 is not quite machine precision, but it might be due to some carries, so it might still be a roundoff diff. This is good, since it suggests that we're not messing up the assembly (phew).

My best guess is that we need to be careful with our prec specifications. I just tried input_fo_gis_unstructT.yaml, and I also get same linear conv history for serial and 32 ranks. That test uses Ifpack2, but the dof ordering is column-wise, so each rank does contain a piece of BC, so the local matrices are not singular (note: I am talking a bit out of my @$$, since I don't really know how the ifpack preconditioner works, and the singularity of local matrices might not be relevant).

@jewatkins could you remind me which input file and input mesh you used?

jewatkins commented 3 years ago

I'm using a modified version of https://github.com/ikalash/ali-perf-tests/blob/master/perf_tests/humboldt-3-20km/input_albany_Velocity_MueLu_Wedge.yaml where I switch to "Use Serial Mesh: true" and Epetra/AztecOO/ML. Mesh files are here: https://github.com/ikalash/ali-perf-tests/tree/master/meshes/humboldt-3-20km

I could add the exact input file I'm using if you want to try running it.

bartgol commented 3 years ago

Thanks! No need to add the exact input file.

bartgol commented 3 years ago

So, I tried that input file, with the meshes in /home/projects/albany/ali-perf-tests-meshes. Use Serial Mesh causes little change in the convergence history (serial mesh gives 1 less linear solver iteration at each solve, but the residuals look fairly similar). Furthermore, serial vs 32 ranks is fairly similar. I get about 2-3 more iterations with 32 ranks for each linear solve, compared to the serial case.

Is this what you are observing for that mesh/input-file? Note: I'm not using SFad, I'm using DFad (the default in Albany).

mperego commented 3 years ago

OK, so we don't get significant changes on the Humbolt mesh. @jewatkins what solver are you using on the Greenland mesh? Belos or AztecOO? Does the choice of the solver make a difference? Have you tried the Thwaites mesh? Maybe there is something wrong with the Greenland meshes.

jewatkins commented 3 years ago

So, I tried that input file, with the meshes in /home/projects/albany/ali-perf-tests-meshes. Use Serial Mesh causes little change in the convergence history (serial mesh gives 1 less linear solver iteration at each solve, but the residuals look fairly similar). Furthermore, serial vs 32 ranks is fairly similar. I get about 2-3 more iterations with 32 ranks for each linear solve, compared to the serial case.

Is this what you are observing for that mesh/input-file? Note: I'm not using SFad, I'm using DFad (the default in Albany).

Yes, that sounds about right. I used 2 and 20 ranks and these are the results I got:

humboldt, 3-20km, ML preconditioner, 20 ranks:
Total linear iterations in first nonlinear iteration: 25
Solution average after nonlinear convergence: 0.1072498668

humboldt, 3-20km, ML preconditioner, 2 ranks:
Total linear iterations in first nonlinear iteration: 23
Solution average after nonlinear convergence: 0.107249664344

OK, so we don't get significant changes on the Humbolt mesh. @jewatkins what solver are you using on the Greenland mesh? Belos or AztecOO? Does the choice of the solver make a difference?

On the Greenland mesh, I've tried Belos/AztecOO, muelu/ml and it didn't make a difference (I still had an inconsistency).

Have you tried the Thwaites mesh? Maybe there is something wrong with the Greenland meshes.

I tried 8 ranks and 80 ranks with Thwaites and did not see a large discrepancy (linear iterations: 95/96, solution average: -162.729919149/-162.729919148). I've only seen a large difference when running the Greenland 1-7km and 1-10km. Maybe the large variation in the domain of those meshes causes the error to be more noticeable? There's still a small difference in the Humboldt case so the error could still be there, just not as noticeable.

mperego commented 3 years ago

Uhm, Thwaites is a reasonably sized ice sheet and with non trivial dynamics. So, I suspect there's something wrong with those meshes that makes the problem very ill conditioned. Can you put this on hold for a week or so? I'm now working on a new Greenland initialization and we can target that when it's ready.

jewatkins commented 3 years ago

Another interesting bit of information. For the Humboldt case, if I use similar boundary conditions and ML settings to the nightly test (must be both), I get the exact number of iterations and solution average. I.e. if I add this:

    Dirichlet BCs:
      DBC on NS extruded_boundary_node_set_3 for DOF U0 prescribe Field: velocity
      DBC on NS extruded_boundary_node_set_3 for DOF U1 prescribe Field: velocity
                ML:
                  Base Method Defaults: none
                  ML Settings:
                    default values: SA
                    ML output: 0
                    'repartition: enable': 1
                    'repartition: max min ratio': 1.32699999999999995e+00
                    'repartition: min per proc': 600
                    'repartition: Zoltan dimensions': 2
                    'repartition: start level': 4
                    'semicoarsen: number of levels': 2
                    'semicoarsen: coarsen rate': 12
                    'smoother: sweeps': 4
                    'smoother: type': Chebyshev
                    'smoother: Chebyshev eig boost': 1.19999999999999995e+00
                    'smoother: sweeps (level 0)': 1
                    'smoother: sweeps (level 1)': 4
                    'smoother: type (level 0)': line Jacobi
                    'smoother: type (level 1)': line Jacobi
                    'smoother: damping factor': 5.50000000000000044e-01
                    'smoother: pre or post': both
                    'coarse: type': Amesos-KLU
                    'coarse: pre or post': pre
                    'coarse: sweeps': 4

I get this for both:

number of linear iterations on first nonlinear iteration: 17
mean value of final solution: 2.66149625626

mperego commented 3 years ago

What BC where you using before for Humboldt? (B.t.w. when I was telling you not to prescribe any Dirichlet BC on the laterla boundary I was referring to Greenland, not to Humboldt).

bartgol commented 3 years ago

It might be that some ranks pick up a high-speed region, while others pick up slow-speed ones. This is especially possible for highly non-uniform meshes.

jewatkins commented 3 years ago

Uhm, Thwaites is a reasonably sized ice sheet and with non trivial dynamics. So, I suspect there's something wrong with those meshes that makes the problem very ill conditioned. Can you put this on hold for a week or so? I'm now working on a new Greenland initialization and we can target that when it's ready.

I'm okay with this. Since there's cases where I'm able to get the same number of iterations/average solution between a low/high rank case, I'll assume that this is some sort of issue with the mesh and continue trying to improve the GPU runs (working on cases that work like thwaites). I recall seeing the same issue there but maybe not because I can't find the runs.

jewatkins commented 3 years ago

What BC where you using before for Humboldt? (B.t.w. when I was telling you not to prescribe any Dirichlet BC on the laterla boundary I was referring to Greenland, not to Humboldt).

I was using this:

    LandIce BCs:
      Number : 2
      BC 0:
        Type: Basal Friction
        Side Set Name: basalside
        Cubature Degree: 4
        Basal Friction Coefficient:
          Type: Exponent Of Given Field
          Given Field Variable Name: basal_friction_log
          Zero Beta On Floating Ice: true
      BC 1:
        Type: Lateral
        Cubature Degree: 4
        Side Set Name: extruded_boundary_side_set_1

Something similar is done in the nightly input file too but it has that extra term on extruded_boundary_node_set_3. What's the difference between extruded_boundary_node_set_1 and extruded_boundary_node_set_3?

mperego commented 3 years ago

Humboldt mesh is obtained cutting out a piece of Greenland ice sheet. On the internal (artificial) boundary (node_set_3) we prescribe the velocity obtained from a full Greenland simulation. On the real ice margin (side_set_1) we prescribe the usual lateral condition, which account for ocean backpressure if the ice is partially submerged in ocean.

mperego commented 3 years ago

Uhm, Thwaites is a reasonably sized ice sheet and with non trivial dynamics. So, I suspect there's something wrong with those meshes that makes the problem very ill conditioned. Can you put this on hold for a week or so? I'm now working on a new Greenland initialization and we can target that when it's ready.

I'm okay with this. Since there's cases where I'm able to get the same number of iterations/average solution between a low/high rank case, I'll assume that this is some sort of issue with the mesh and continue trying to improve the GPU runs (working on cases that work like thwaites). I recall seeing the same issue there but maybe not because I can't find the runs.

Yeah, I think it's probably the best course of action.

jewatkins commented 3 years ago

Ah okay, thanks for the info. I should update my Humboldt cases.

sandialabs / Albany

Inconsistent results when using different numbers of MPI ranks in ALI #688